LAMP-TR-056 

CAR-TR-951 

CS-TR-4174 


MDA  9049-6C-1250 
August  2000 


Learning  Algorithms  for  Audio  and  Video 
Processing — Independent  Component  Analysis  and 
Support  Vector  Machine  Based  Approaches 


Yuan  Qi 

Center  for  Automation  Research 
University  of  Maryland 
College  Park,  MD  20742-3275 


Abstract 

In  this  thesis,  we  propose  two  new  machine  learning  schemes,  a  Subband-based  Indepen¬ 
dent  Component  Analysis  scheme  and  a  hybrid  Independent  Component  Analysis/ Support 
Vector  Machine  scheme,  and  apply  them  to  the  problems  of  blind  acoustic  signal  separation 
and  face  detection. 

Based  on  a  linear  model,  classical  Independent  Component  Analysis  (ICA)  provides  a 
method  of  representing  data  as  independent  components.  In  contrast  to  Principal  Compo¬ 
nent  Analysis  (PCA),  which  decorrelates  the  data  based  on  its  covariance  matrix,  ICA  uses 
higher-order  statistics  of  the  data  to  minimize  the  dependence  between  the  components  of  the 
system  output.  An  important  application  of  ICA  is  blind  source  separation.  However,  classical 
ICA  algorithms  do  not  work  well  for  separation  in  the  presence  of  noise  or  when  performed 
on-line.  Inspired  by  the  psychoacoustic  discovery  that  humans  perceive  and  process  acoustic 
signals  in  different  frequency  bands  independently,  we  propose  a  new  algorithm,  subband-based 
ICA,  that  integrates  ICA  with  time-frequency  analysis  to  separate  mixed  signals.  In  subband- 
based  ICA,  the  separations  are  performed  in  parallel  in  several  frequency  bands.  Wavelet 
decomposition  and  best  basis  selection  in  wavelet/DCT  packets  can  be  incorporated  into  this 
algorithm.  Subband-based  ICA  is  computationally  fast,  robust  to  noise,  and  works  well  in  an 
on-line  version  when  other  ICA  algorithms  fail.  The  virtually  increased  signal-to-noise  ratio 
in  those  frequency  bands  where  the  separations  are  actually  performed,  and  the  fact  that  sub¬ 
band  signals,  i.e.,  wavelet  coefficients,  are  more  peaky  and  heavy-tailed  distributed  than  the 
original  signals,  both  contribute  to  the  success  of  subband-based  ICA.  Experimental  results  on 
separating  noisy  speech  mixtures  and  musical  signal  mixtures  demonstrate  its  effectiveness. 
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In  addition  to  separating  mixed  signals,  ICA  can  also  be  nsed  as  a  featnre  extractor.  As 
argned  by  many  researchers  in  the  nenral  research  area,  a  principle  of  sensory  information 
processing  in  the  brain  is  rednndancy  rednction.  The  ICA  representation  of  the  data  follows 
this  principle.  Also,  from  a  signal  processing  viewpoint,  ICA  provides  a  nice  way  to  clnster 
independent  signals  and  hence  leads  to  a  better  representation  of  signals  than  PCA. 

Motivated  by  the  featnre  extraction  capabihty  of  ICA,  we  propose  a  new  hybrid  nnsnper- 
vised/snpervised  learning  scheme  that  integrates  Independent  Component  Analysis  with  the 
Snpport  Vector  Machine  (SVM)  approach  and  apply  this  new  learning  scheme  to  the  face 
detection  problem.  SVM  is  a  new  powerfnl  machine  learning  algorithm  which  is  rooted  in 
statistical  learning  theory.  As  an  approximate  implementation  of  the  Strnctnral  Risk  Min¬ 
imization  (SRM)  Principle  proposed  in  statistical  learning  theory,  SVM  tends  to  have  good 
generahzation  performance.  One  common  characteristic  shared  by  ICA  and  SVM  is  sparsity. 
The  ICA  ontpnt  is  sparse,  and  the  snpport  vectors  whose  linear  combination  comprises  the 
trained  SVM  are  also  sparse.  Thns  integrating  ICA  with  SVM  yields  a  new  hybrid  hierarchical 
sparse  learning  scheme. 

Specihcally,  for  the  face  detection  problem  we  nse  ICA  in  two  different  ways  to  extract 
low-level  featnres  from  a  window  sliding  over  an  image,  and  then  apply  SVM  at  a  high  level  to 
classify  the  extracted  ICA  featnres  as  a  face  or  not.  Experimental  resnlts  show  that  nsing  the 
hrst  method  to  extract  ICA  featnres  and  applying  SVM  for  classihcation  effectively  improves  the 
detection  system  performance,  compared  with  applying  SVM  directly  to  the  original  image  data. 

Finally  as  a  general  learning  scheme,  hybrid  ICA/SVM  can  be  applied  to  other  pattern 
recognition  problems  as  well  as  to  face  detection. 


2 


LAMP-TR-056 

CAR-TR-951 

CS-TR-4174 


MDA-9049-6C-1250 
August  2000 


Learning  Algorithms  for  Audio  and  Video 
Processing — Independent  Component  Analysis 
and  Support  Vector  Machine  Based  Approaches 

Yuan  Qi 


Learning  Algorithms  for  Audio  and  Video 
Processing — Independent  Component  Analysis 
and  Support  Vector  Machine  Based  Approaches 


Yuan  Qi 

Center  for  Automation  Research 
University  of  Maryland 
College  Park,  MD  20742-3275 


Abstract 

In  this  thesis,  we  propose  two  new  machine  learning  schemes,  a  Subband-based  Indepen¬ 
dent  Component  Analysis  scheme  and  a  hybrid  Independent  Component  Analysis/ Support 
Vector  Machine  scheme,  and  apply  them  to  the  problems  of  blind  acoustic  signal  separation 
and  face  detection. 

Based  on  a  linear  model,  classical  Independent  Component  Analysis  (ICA)  provides  a 
method  of  representing  data  as  independent  components.  In  contrast  to  Principal  Compo¬ 
nent  Analysis  (PCA),  which  decorrelates  the  data  based  on  its  covariance  matrix,  ICA  uses 
higher-order  statistics  of  the  data  to  minimize  the  dependence  between  the  components  of  the 
system  output.  An  important  application  of  ICA  is  blind  source  separation.  However,  classical 
ICA  algorithms  do  not  work  well  for  separation  in  the  presence  of  noise  or  when  performed 
on-line.  Inspired  by  the  psychoacoustic  discovery  that  humans  perceive  and  process  acoustic 
signals  in  different  frequency  bands  independently,  we  propose  a  new  algorithm,  subband-based 
ICA,  that  integrates  ICA  with  time-frequency  analysis  to  separate  mixed  signals.  In  subband- 
based  ICA,  the  separations  are  performed  in  parallel  in  several  frequency  bands.  Wavelet 
decomposition  and  best  basis  selection  in  wavelet/DCT  packets  can  be  incorporated  into  this 
algorithm.  Subband-based  ICA  is  computationally  fast,  robust  to  noise,  and  works  well  in  an 
on-line  version  when  other  ICA  algorithms  fail.  The  virtually  increased  signal-to-noise  ratio 
in  those  frequency  bands  where  the  separations  are  actually  performed,  and  the  fact  that  sub¬ 
band  signals,  i.e.,  wavelet  coefficients,  are  more  peaky  and  heavy-tailed  distributed  than  the 
original  signals,  both  contribute  to  the  success  of  subband-based  ICA.  Experimental  results  on 
separating  noisy  speech  mixtures  and  musical  signal  mixtures  demonstrate  its  effectiveness. 

In  addition  to  separating  mixed  signals,  ICA  can  also  be  used  as  a  feature  extractor.  As 
argued  by  many  researchers  in  the  neural  research  area,  a  principle  of  sensory  information 


processing  in  the  brain  is  rednndancy  redaction.  The  ICA  representation  of  the  data  follows 
this  principle.  Also,  from  a  signal  processing  viewpoint,  fCA  provides  a  nice  way  to  clnster 
independent  signals  and  hence  leads  to  a  better  representation  of  signals  than  PCA. 

Motivated  by  the  featnre  extraction  capabihty  of  fCA,  we  propose  a  new  hybrid  nnsnper- 
vised/snpervised  learning  scheme  that  integrates  Independent  Component  Analysis  with  the 
Snpport  Vector  Machine  (SVM)  approach  and  apply  this  new  learning  scheme  to  the  face 
detection  problem.  SVM  is  a  new  powerfnl  machine  learning  algorithm  which  is  rooted  in 
statistical  learning  theory.  As  an  approximate  implementation  of  the  Strnctnral  Risk  Min¬ 
imization  (SRM)  Principle  proposed  in  statistical  learning  theory,  SVM  tends  to  have  good 
generahzation  performance.  One  common  characteristic  shared  by  fCA  and  SVM  is  sparsity. 
The  fCA  ontpnt  is  sparse,  and  the  snpport  vectors  whose  linear  combination  comprises  the 
trained  SVM  are  also  sparse.  Thns  integrating  fCA  with  SVM  yields  a  new  hybrid  hierarchical 
sparse  learning  scheme. 

Specihcally,  for  the  face  detection  problem  we  nse  fCA  in  two  different  ways  to  extract 
low-level  featnres  from  a  window  sliding  over  an  image,  and  then  apply  SVM  at  a  high  level  to 
classify  the  extracted  fCA  featnres  as  a  face  or  not.  Experimental  resnlts  show  that  nsing  the 
hrst  method  to  extract  fCA  featnres  and  applying  SVM  for  classihcation  effectively  improves  the 
detection  system  performance,  compared  with  applying  SVM  directly  to  the  original  image  data. 

Finally  as  a  general  learning  scheme,  hybrid  fCA/SVM  can  be  applied  to  other  pattern 
recognition  problems  as  well  as  to  face  detection. 
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Chapter  1 


Introduction 


In  this  thesis,  we  propose  two  new  machine  learning  schemes,  a  Snbband-based  Independent 
Component  Analysis  scheme  and  a  hybrid  Independent  Component  Analysis /Snpport  Vector 
Machine  scheme,  and  apply  them  in  the  problems  of  blind  aconstic  signal  separation  and  face 
detection.  This  introdnction  provides  brief  snmmaries  of  these  two  learning  schemes  and  an 
ontline  of  the  dissertation.  In  Section  1. 1  we  give  a  short  backgronnd  review  of  Independent 
Component  Analysis  and  Snpport  Vector  Machines.  The  motivations  and  concise  descriptions 
of  onr  two  new  learning  schemes  are  described  in  Section  1.2,  and  the  organization  of  the 
dissertation  is  ontlined  in  Section  1.3. 

1.1  Background  Review 

1.1.1  Independent  Component  Analysis 

A  common  problem  in  statistics,  signal  processing,  and  nenral  network  research  is  how  to  de¬ 
sign  an  appropriate  representation  for  mnltivariate  data.  Based  on  a  hnear  model.  Independent 
Component  Analysis  offers  a  method  of  representing  the  data  as  independent  components.  In 
contrast  to  Principal  Component  Analysis,  which  decorrelates  the  data  based  on  its  covariance 
matrix,  ICA  nses  higher-order  statistics  of  the  data  to  minimize  the  dependence  between  the 
components  of  the  representation.  Snch  a  representation  seems  to  captnre  the  essential  strnc- 
tnre  of  the  data  in  many  problems.  As  a  resnlt,  ICA  is  being  nsed  in  an  increasing  nnmber  of 
applications,  snch  as  speech  enhancement  and  recognition,  telecommnnication,  biomedical  sig¬ 
nal  analysis,  and  image  denoising  and  recognition  [3,  12,  39,  8,  42,  36,  21].  In  these  applications, 
the  problems  to  which  ICA  is  applied  inclnde  bhnd  sonrce  separation,  blind  deconvolntion,  and 
featnre  extraction. 

In  the  bhnd  sonrce  separation  problem,  ICA  is  apphed  to  recover  independent  nnknown 
sonrces  given  only  sensor  observations  that  are  nnknown  linear  mixtnres  of  the  nnobserved 
sonrces  and  noise.  ICA  has  been  snccessfnlly  apphed  to  separate  aconstic  signals,  electroen- 
cephalographic  (EEC)  signals,  and  magnetoencephalographic  (MEG)  signals.  Also,  ICA  has 
been  nsed  in  the  blind  eqnalization  and  Code  Division  Mnltiple  Access  (CDMA)  system  in 
communications.  For  the  bhnd  deconvolntion  problem,  if  we  transform  the  data  to  the  fre- 
qnency  domain,  the  problem  becomes  the  same  as  the  bhnd  separation  problem,  so  that  it  can 
be  tackled  by  ICA  too. 

In  the  featnre  extraction  problem,  ICA  aims  to  hnd  an  independent  basis  or  representation 
coefficients  for  the  data.  In  [7,  6,  5,  20],  Barlow  et  al.  argned  that  a  principle  of  sensory 
information  processing  in  the  brain  is  rednndancy  rednction.  The  ICA  representation  of  the 
data  follows  this  principle.  In  [9]  Bell  and  Sejnowski  point  ont  that  the  independent  components 
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of  natural  scenes  are  edge  filters.  In  [48],  Olshausen  and  Field  show,  under  the  noise-free 
assumption,  an  equivalence  between  an  fCA  algorithm  and  sparse  coding,  another  method  of 
implementing  the  redundancy  reduction  principle.  In  [63],  Hateren  et  al.  report  a  detailed 
comparison  between  fCA  features  and  the  properties  of  simple  cells  in  the  macaque  primary 
visual  cortex,  and  hud  good  matches  to  most  of  the  parameters.  Besides  these  discoveries  of 
psychological  and  neural  research,  from  the  signal  processing  viewpoint,  1C  A  provides  a  nice 
way  of  clustering  independent  signals  and  hence  leads  to  a  better  representation  of  signals  than 
classical  Principal  Component  Analysis.  This  also  justihes  the  use  of  fCA  for  feature  extraction. 
fCA  feature  extraction  has  been  applied  to  face  recognition  and  image  denoising  and  satisfying 
results  have  been  obtained. 

1.1.2  Support  Vector  Machines  and  Statistical  Learning  Theory 

The  Support  Vector  Machine  (SVM)  is  a  powerful  machine  learning  algorithm,  which  is  rooted 
in  statistical  learning  theory.  According  to  the  Structural  Risk  Minimization  (SRM)  Principle 
in  statistical  learning  theory  [65],  the  error  rate  of  a  learning  machine  on  test  data  is  bounded 
by  the  sum  of  the  training  error  rate  and  a  term  that  depends  on  the  Vapnik-Chervonenkis 
(VC)  dimension  and  indicates  the  complexity  of  the  model.  By  hrst  nonlinearly  mapping  the 
input  data  into  a  high- dimensional  feature  space,  and  then  constructing  a  hyperplane  as  the 
decision  surface  in  that  space  which  leaves  the  maximal  margin  between  positive  and  negative 
examples,  SVM  approximately  implements  the  SRM  Principle.  Thus  the  training  error  rate  and 
the  model  complexity  can  be  minimized  at  the  same  time  by  SVM.  Therefore,  in  theory,  SVM 
tends  to  have  good  generalization  performances.  Many  applications  have  also  demonstrated  the 
good  generahzation  performance  of  SVM,  including  isolated  handwritten  digit  recognition  [58], 
object  recognition  [10],  speaker  identihcation  [57],  and  face  detection  [49]. 

In  addition  to  good  generalization  performance,  SVM  has  many  other  nice  properties: 

•  By  reformulating  the  primary  quadratic  programming  (QP)  problem  encountered  in  SVM 
training  into  its  dual  problem  and  using  a  suitable  inner-product  kernel,  SVM  controls 
the  model  complexity  independently  of  the  dimensionality  of  the  feature  space.  Actually, 
inhnite  feature  spaces  are  allowed  in  SVM. 

•  Moreover,  the  convex  cost  function  in  the  QP  problem  guarantees  that  SVM  will  hnd  a 
globally  optimal  solution,  while  many  other  learning  algorithms  suffer  from  falling  into 
local  extrema. 

•  By  solving  the  QP  problem  during  the  training  phase,  SVM  automatically  tunes  all  the 
parameters  in  a  learning  scheme. 

•  The  support  vectors,  whose  linear  combination  comprises  the  trained  SVM,  are  usually 
sparse.  By  reformulating  SVM  in  the  framework  of  regularization  theory,  Girosi  [29]  shows 
an  equivalence  between  SVMs  and  a  Sparse  Approximation  (SA)  scheme  that  resembles 
the  Basis  Pursuit  De-Noising  algorithm  [14].  This  reveals  the  relationship  between  SVM 
and  other  known  techniques. 

1.2  Motivation  and  Contributions 

Motivated  by  discoveries  in  mammalian  acoustic  and  visual  systems,  we  propose  two  new  learn¬ 
ing  schemes  for  acoustic  and  visual  signal  processing,  which  are  briefly  described  in  the  follow¬ 
ing  sections. 
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1.2.1  Integration  of  ICA  and  Time-Frequency  Analysis 

Though  classical  ICA  algorithms  have  been  applied  to  address  the  problem  of  bhnd  source 
separation,  they  do  not  work  well  in  the  presence  of  noise  or  when  performed  on-line.  Inspired 
by  the  psychoacoustic  discovery  that  humans  perceive  and  process  acoustic  signals  in  different 
frequency  bands  independently  [1,  43],  we  propose  a  new  algorithm,  subband-based  ICA,  that 
integrates  ICA  with  time- frequency  analysis  to  separate  mixed  signals.  In  subband-based  ICA, 
the  separations  are  performed  in  parallel  in  several  frequency  bands.  Wavelet  decomposition  and 
best  basis  selection  in  wavelet/DCT  packets  can  be  incorporated  into  this  algorithm.  Subband 
based  ICA  is  computationally  fast,  robust  to  noise,  and  works  well  in  an  on-line  version  when 
other  ICA  algorithms  fail.  The  virtually  increased  signal-to-noise  ratio  in  those  frequency 
bands,  the  fact  that  subband  signals,  i.e.,  wavelet  coefficients,  are  more  peaky  and  heavy-tailed 
distributed  than  the  original  signals,  and  the  adaptation  to  the  properties  of  the  signal  and 
noise  by  the  incorporation  of  best  basis  selection  algorithm,  all  contribute  to  the  success  of 
subband-based  ICA.  Experimental  results  on  separating  noisy  speech  mixtures  and  musical 
signal  mixtures  demonstrate  its  effectiveness. 

1.2.2  Face  Detection  Based  on  the  Hybrid  ICA/SVM  Learning  Scheme 

Motivated  by  the  feature  extraction  capability  of  ICA  as  mentioned  in  Section  1. 1. 1,  we  propose 
a  new  hybrid  unsupervised/supervised  learning  scheme  that  integrates  Independent  Component 
Analysis  with  the  Support  Vector  Machine  and  we  apply  this  new  learning  scheme  to  the  face 
detection  problem.  Specihcally,  we  use  ICA  in  two  different  ways  to  extract  low-level  features 
from  a  window  sliding  over  an  image,  and  then  apply  SVM  at  a  high  level  to  decide  whether  the 
extracted  ICA  features  represent  a  face.  Experimental  results  show  that  using  the  hrst  method 
of  extracting  ICA  features  effectively  improves  detection  system  performance,  compared  with 
applying  SVM  directly  to  the  original  image  data. 

An  interesting  comment  about  the  hybrid  learning  scheme  is  that  ICA  and  SVM  share  a 
common  characteristic,  sparsity.  The  ICA  output  is  sparse,  and  the  support  vectors  in  SVM 
are  also  sparse.  Hence  the  hybrid  learning  scheme  has  a  hierarchical  sparse  architecture. 

Furthermore,  as  a  general  learning  scheme,  hybrid  ICA/SVM  can  be  applied  to  pattern 
recognition  problems  other  than  face  detection. 

1.3  Thesis  Outline 

The  rest  of  the  thesis  is  organized  as  follows.  In  Chapter  2,  we  introduce  the  classical  ICA 
algorithm,  propose  the  new  subband-based  ICA,  and  apply  the  new  algorithm  to  separating 
mixed  acoustic  signals.  In  Chapter  3,  we  present  a  review  and  discussion  of  SVM.  Finally, 
in  Chapter  4,  we  propose  the  hybrid  ICA/SVM  learning  scheme,  and  apply  it  to  the  face 
detection  problem. 
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Chapter  2 


Subband-based  Independent  Component  Analysis 


2.1  Introduction 

Independent  Component  Analysis  (ICA)  can  recover  independent  sonrces  given  only  sensor 
observations  that  are  nnknown  linear  mixtnres  of  the  nnobserved  sonrce  signals  and  noise.  In 
contrast  to  Principal  Component  Analysis,  which  decorrelates  signals  based  on  the  covariance 
matrix,  ICA  nses  higher-order  statistics  of  the  signals  to  hnd  independent  components.  ICA 
has  many  applications  in  speech  enhancement  and  recognition,  telecommnnication,  biomedical 
signal  analysis,  and  image  denoising  and  recognition  [3,  12,  39,  8,  42,  36].  However,  classical 
ICA  algorithms  do  not  work  well  on-line  or  in  the  presence  of  noise.  Inspired  by  the  psychoa- 
constic  discoveries  connecting  anditory  perception  and  wavelet  theory,  a  new  ICA  algorithm, 
snbband-based  ICA,  is  proposed  to  separate  independent  signals.  Experimental  resnlts  on  sep¬ 
arating  mixed  aconstic  signals  demonstrate  its  robnstness  to  noise  and  its  high  efficiency  when 
performed  on-line. 

2.2  Classical  ICA  System  Model  and  Learning  Rule 

While  several  nonlinear  ICA  algorithms  have  been  proposed  [37,  40],  most  of  the  contribntions 
to  the  ICA  literatnre  are  based  on  the  linear  inpnt  mixtnre  model,  which  is  dehned  as 

x(f)  =  As(t)  +  h(t), 

where  s(f)  =  [si(f),  52(1)5  ■■■: i®  nnknown  sonrce  signal  vector  at  discrete  time  f, 
x(f)  =  X2{t), . .  .,Xn{t)]'^  is  the  observation  signal  vector,  A  is  a  fnU-rank  n  X  n  mixing 

matrix,  and  b(t)  is  noise.  The  components  of  the  vector  s(f),  i.e.,  5i(f),  52(f),  ■  ■  ■,'S„(f),  come 
from  n  independent  sonrces.  Unhke  factor  analysis  addressed  by  an  EM  algorithm  [28],  which 
assnmes  that  h(t)  is  normally  distribnted  with  a  diagonal  covariance  matrix  and  s(t)  is  also 
normally  distribnted,  ICA  algorithms  are  derived  on  the  assnmption  of  noise- free  measnrements. 
In  practice,  many  ICA  algorithms  do  not  work  well  on  noisy  mixtnres. 

Given  the  mixtnre  model,  the  aim  of  ICA  is  to  recover  the  original  sonrce  signal  s(f).  To 
this  end,  the  following  simple  separation  model  is  nsed,  corresponding  to  the  above  hnear 
mixtnre  model: 

y(f)  =  Wx(f), 

where  y(f)  =  ^2(1),  ■  ■  • ,  is  an  estimate  of  s(f)  and  W  is  the  nnmixing  matrix,  i.e., 

an  estimate  of  the  inverse  of  A. 

To  obtain  the  learning  rnle  for  the  nnmixing  matrix  W,  we  nse  the  natnral  gradient  [4]  to 
minimize  the  Knllback-Leibler  divergence  between  the  sonrce  signal  vector  s  and  its  estimate 
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y,  i-e., 

D{fy  II  /s)  =  J  fy(t)log 

where  /y  and  /g  are  the  probabihty  density  fnnctions  (pdfs)  of  y  and  s.  The  pdfs  are  approx¬ 
imated  by  trnncation  of  the  Gram-Charher  expansion.  The  following  learning  rnle  is  then  ob¬ 
tained: 


Q  =  I  -  g(y(»^))y^(»^), 

W(n  +  f)  =  W(n)  +  r/(n)QW^(n), 


(2.1) 

(2.2) 


where  I  is  the  identity  matrix,  r/(n)  is  the  learning  rate,  and  g(y)  =  is  a 

nonlinear  fnnction  [31], 


/  ^  Is  27  15  9  2  n  112  13  IS  512  17 

g(z)  =  -z®  +  - +  128z^® - 

^  2  3  2  15  3  3 


(2.3) 


Since  natnral  signals  are  nsnally  snper-Ganssian,  we  can  also  simply  nse  2tanh(z)  as  the  non¬ 
linear  fnnction  g(z)  when  applying  learning  rnles  (2.1)  and  (2.2)  to  separate  speech  or  mnsic 
signals  [39]. 

Fnrthermore,  based  on  [2]  we  can  derive  a  nonholonomic  version  of  the  learning  rnle  that  is 
snitable  for  on-line  signal  separation.  In  the  nonholonomic  version,  the  diagonal  elements  of  Q 
are  set  to  zero. 


2.3  Subband-based  ICA 

Many  psychoaconstic  experiments  have  shown  that  hnmans  perceive  and  process  aconstic  signals 
in  different  freqnency  bands  independently  [1,  43].  Inspired  by  these  discoveries,  we  propose  a 
new  algorithm,  namely,  snbband-based  ICA,  that  integrates  ICA  with  time-freqnency  analysis 
to  separate  mixed  signals.  Snbband-based  ICA  and  the  early  anditory  models  are  compared  in 
Fignre  2.1.  The  new  algorithm  can  accomplish  the  separation  task  snccessfnlly  in  the  presence 
of  strong  noise,  or  when  working  in  an  on-line  version. 

The  ontline  of  the  algorithm  is  described  in  the  following: 

1.  First,  each  component  Xj(n)  of  the  observation  x(n),  where  1  <  j  <  m,  is  hltered  into 
snbband  signals. 

Thongh  digital  hlter  banks  have  been  bnilt  to  mimic  the  snbbanding  fnnction  of  the 
cochlea  [68],  for  simphcity  and  to  provide  the  linearity  reqnired  by  ICA,  the  orthogonal 
Danbechies  wavelet  packet  decomposition  [19]  is  nsed  instead  of  the  cochlear  hlter  bank: 

a:*(n)  =<  >,  (2.4) 

where  xj’^  =  {xj{n),Xj{n  -  1),  ■  ■  •,Xj{n  -  N  +  1)),  ef  =  (6^,(1),  €*,(2),  •  ■  ■,ek{N))  is  a 
vector  of  coefficients  determined  by  the  band  Danbechies  wavelet  hlter,  and  A  is  a 
window  size. 

2.  The  averaged  powers  of  the  decomposed  signals  in  every  band  are  compnted  and  sorted 
by  a  fast  sorting  algorithm,  for  example  heap  sorting. 

3.  Then  the  nonholonomic  learning  rnle  (i.e.,  (2.1)  and  (2.2)  with  the  diagonal  elements  of 
Q  being  zeros)  is  applied  only  to  the  bands  that  have  the  strongest  power,  for  example, 
to  the  strongest  fonrth  of  all  the  signal  bands,  for  the  following  reasons: 
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Dc-noising 


Figure  2.1:  Subband-based  ICA  and  early  auditory  models 
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•  If  the  noise  is  broad-band,  the  signal  to  noise  ratio  (SNR)  will  be  larger  for  those 
bands  which  have  the  strongest  power. 

•  If  the  noise  is  limited  to  narrow  bands,  many  signal  bands  will  be  noiseless,  which 
means  that  good  separation  resnlts  can  be  obtained  in  those  bands. 

We  denote  the  demixing  matrix  in  the  selected  band  by 


W* 


/  wj  \ 
V  w*  / 


where  row  W|,  1  <  J  <  n,  is  nsed  to  get  to  the  component,  Vjii),  of  the  estimated 
sonrce  signal  y(f)  in  the  band. 

4.  Noise  is  rednced  nsing  a  soft  thresholding  algorithm  [22]  apphed  to  the  snbband  decom¬ 
posed  signals. 

5.  To  recover  the  estimated  sonrce  signal  y(f),  we  have  two  methods: 

a.  First  recover  the  overall  nnmixing  matrix  W  from  the  nnmixing  matrices  associated 
with  different  snbbands,  and  then  recover  y(t)  from  Wx(f).  Competitive  learn¬ 
ing  [31]  is  applied  to  clnster  the  rows  of  the  nnmixing  matrices  obtained  in  different 
snbbands.  The  overall  nnmixing  matrix  W  consists  of  n  clnstered  rows. 

b.  Recover  y(f)  directly  from  the  1  <  J  <  n  by  the  wavelet  packet  reconstrnction 

algorithm. 

Depending  on  the  practical  sitnation,  we  can  choose  (a)  or  (b)  to  get  the  best  resnlt. 


Note  that  besides  the  virtnaUy  increased  SNR  in  those  freqnency  bands  where  we  apphed 
the  ICA  learning  rnles,  the  fact  that  snbband  signals,  i.e.,  wavelet  coefficients,  are  more  peaky 
and  heavy-tailed  distribnted  than  the  original  signals  also  greatly  contribntes  to  the  snccess 
of  snbband  based  ICA  when  it  is  applied  to  noisy  mixtnres  or  performed  on-line.  Indeed, 
an  assnmption  nnderlying  the  ICA  learning  rnle  is  that  the  sonrce  signals  are  non-Ganssian. 
However,  the  presence  of  noise  makes  the  signal  mixtnre  more  hke  a  Ganssian.  Also,  even 
with  a  httle  noise,  in  a  short  time  period,  the  mixtnre  signal  distribntion  may  come  close 
to  a  Ganssian  becanse  of  nonstationarity.  Therefore  classical  ICA  algorithms  do  not  work 
well  in  very  noisy  sitnations  when  performed  in  an  on-line  version.  On  the  other  hand,  wavelet 
coefficients  of  signals  are  mnch  sparser  than  the  original  signals,  which  leads  to  a  more  peaky  and 
heavy-tailed  distribntion.  Actnally,  wavelet  coefficients  have  been  modeled  by  a  typical  snper- 
Ganssian  distribntion,  a  Laplace  distribntion,  in  wavelet  denoising  and  coding  research  [61]. 
Onr  simnlations  on  speech  and  mnsic  signals  also  prove  this  point.  By  applying  the  learning 
rnle  to  the  snper- Ganssian  snbband  signals,  snbband-based  ICA  converges  to  the  nnmixing 
matrix  qnickly  even  in  the  case  of  noisy  mixtnres  or  when  performed  on-line. 


2.4  Adaptive  Basis  Selection  in  Wavelet/DCT  Packets 

Snbband-based  ICA  enhances  the  separation  capability  by  decomposition  of  the  signal  into 
different  freqnency  bands.  Bnt  the  problem  of  designing  the  hlter  bank  remains.  For  example, 
it  is  desirable  that  we  do  not  split  the  signal  into  two  bands  at  the  freqnency  where  the  energy 
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of  the  signal  is  concentrated,  becanse  otherwise  we  might  segment  one  or  several  continnons 
signal  streams  in  the  time-freqnency  plane  into  two  different  bands,  which  conld  affect  the 
performance  of  fCA  in  each  band.  So,  depending  on  different  signal  properties,  we  can  design 
different  hlter  banks  to  improve  the  performance  of  snbband-based  fCA. 

To  address  this  problem,  we  incorporated  the  adaptive  basis  selection  algorithm,  proposed 
by  Coifman  et  al.  [15],  into  the  snbband-based  1C  A  algorithm. 

As  in  the  procednre  described  in  Section  2. .3,  we  have  the  following  steps: 

1.  First  we  choose  Shannon  entropy  as  the  cost  fnnction  and  apply  the  adaptive  basis  selec¬ 
tion  algorithm  nsing  Wavelet  or  DCT  packets  (see  the  details  in  [15])  to  the  snmmation 
of  the  mixed  signals  to  get  the  best  basis. 

2.  Then  we  project  each  mixed  signal  onto  the  best  basis. 

3.  The  learning  rnle  is  applied  only  to  those  of  the  projected  signals  that  have  the  strongest 
normalized  power.  Noise  is  rednced  by  thresholding  if  necessary. 

4.  Competitive  learning  is  nsed  to  gronp  the  rows  of  the  nnmixing  matrices  obtained  from 
different  bases  to  get  the  overall  nnmixing  matrix  W. 

The  best  basis  selection  algorithm  actnally  accomplishes  the  task  of  adaptively  selecting 
hlter  banks  based  on  the  properties  of  the  signal,  which  makes  snbband-based  ICA  more  robnst 
against  noise. 


2.5  Experimental  Results 


First  let  ns  introdnce  a  performance  index  E,  which  is  dehned  as  in  [4]: 

bi  n  n  I  I 

{^^rnaxk\pkj 


-11 


where  P  =  {pij}  =  WA.  The  smaller  the  index  is,  the  better  P  approximates  a  permntation 
matrix  which  has  only  one  nonzero  element  in  each  row  and  each  colnmn,  and  the  better  the 
separation  is. 

In  the  following  paragraphs,  we  report  onr  experimental  resnlts  both  in  batch  and  on¬ 
line  modes. 

In  batch  mode,  we  separated  two  mixtnres  of  two  speech  signals,  randomly  selected  from 
the  TIMIT  speech  hbrary,  and  added  strong  white  noise.  These  speech  signals  were  sampled 
at  8KHz.  The  average  SNR  of  the  mixtnres  was  0.51  dB.  From  the  mixtnres  it  was  hard  to 
nnderstand  any  word  of  the  speech.  Then  snbband-based  ICA  was  applied  to  separate  the 
mixtnre  signals.  The  performance  index  E  of  this  separation  was  0.08  and  the  SNR  increased 
to  5.64  dB.  The  separated  speech  signals  were  nnderstandable,  thongh  stiU  noisy. 

Next,  still  in  batch  mode,  we  tested  onr  algorithm  on  two  mixtnres  of  strong  white  noise  and 
the  test  data  street.wav  and  beet.wav  which  were  nsed  at  the  ICA  1999  conference  [38].  The 
power  of  the  noise  was  the  same  as  the  average  mixed  signal  power,  i.e.,  the  average  SNR  was  0.0 
dB.  Despite  the  low  SNR,  snbband-based  ICA  based  on  adaptive  basis  selection  was  snccessfnl 
in  the  separation.  For  pnrposes  of  comparison,  we  also  tested  the  Fast  ICA  algorithm  [34]  and 
the  Extended  Infomax  algorithm  [39]  on  those  noisy  mixtnres.  The  codes  for  Fast  ICA  and 
Extended  Infomax  were  downloaded  from  [33]  and  [41]  respectively.  For  Extended  Infomax  we 


Approach 

Index  E 

Average  SNR  of 
the  separated  signals 

Subband  based  ICA 

0.051 

4.31  dB 

Fast  ICA 

0.124 

-1.63  dB 

Extended  Infomax 

0.118 

-1.38  dB 

Table  2.1:  Simulation  of  different  ICA  algorithms.  The  average  SNR  of  the  mixed  signal  is 
0.0  dB. 


modihed  the  learning  rate  trying  to  get  the  best  performance  for  our  test  data.  The  separation 
results  are  shown  in  Table  2.1. 

From  the  above  table,  we  can  see  that  subband-based  ICA  is  robust  against  noise.  The 
waveforms  in  the  separation  obtained  using  subband-based  ICA  are  shown  in  Figure  2.2. 


Figure  2.2:  Separation  in  the  presence  of  strong  white  noise 


Our  on-line  separation  experiments  were  as  follows. 

First,  we  on-hne  separated  two  mixtures  of  a  violin  melody  and  a  segment  of  some  pop 
music.  These  musical  signals  were  sampled  at  8K  Hz.  We  used  a  modihed  Extended  Infomax 
algorithm  [39],  nonholonomic  ICA  without  wavelet  decomposition,  and  subband-based  ICA. 
We  modihed  the  Extended  Infomax  algorithm  into  an  on-hne  version  and  changed  its  learning 
rate  to  achieve  good  performance  on  our  test  data.  The  performance  indexes  of  these  three 
algorithms  are  shown  in  Figure  2.3.  From  this  hgure,  we  can  see  that  subband-based  ICA  did 
the  separation  successfully,  while  the  other  two  methods  failed. 
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A.  The  curve  of  the  performance  index  of  the  Entended  Infomax  algorithm 


Figure  2.3:  Experiment  1:  the  curves  of  the  performance  index  E 


Also,  subband-based  fCA  was  much  faster  than  the  other  two  methods.  We  used  a  Sun 
UltrafO  with  500M  memory  to  run  the  Matlab  scripts.  The  times  needed  to  separate  the 
mixtures  are  hsted  in  Table  2.2. 


Approach 

Separation  Time  (sec.) 

Modihed  Extended  Infomax 

61.72 

Nonholonomic  1C  A 

86.68 

Subband-based  ICA 

18.05 

Table  2.2:  Experiment  f:  The  separation  times  needed  by  different  fCA  algorithms 

Second,  we  tested  those  online  algorithms  on  mixtures  of  two  songs.  Those  signals  were  also 
sampled  at  8KHz.  The  same  three  separation  algorithms  were  tested  as  before.  The  curves  of 
the  performance  index  E  are  shown  in  Figure  2.4.  Clearly,  subband-based  fCA  is  much  better 
than  the  other  two  methods. 

In  addition,  the  times  needed  to  separate  the  mixtures  are  listed  in  Table  2.3. 


Approach 

Separation  Time  (sec.) 

Modihed  Extended  Infomax 

1582.62 

Nonholonomic  ICA 

563.24 

Subband-based  ICA 

101.78 

Table  2.3:  Experiment  2:  The  separation  times  needed  by  different  fCA  algorithms 


Third,  we  tested  the  on-line  algorithms  on  the  mixtures  of  two  other  musical  signals.  After 
processing  the  data,  the  performance  index  E  of  the  subband-based  fCA  converged  to  0.0181 
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A.  The  curve  of  the  performance  index  of  the  Entended  Infomax  algorithm 


Figure  2.4:  Experiment  2:  the  curves  of  the  performance  index  E 


while  the  performance  indexes  of  the  Extended  Infomax  algorithm  the  nonholonomic  fCA  were 
still  3.1894  and  2.1132  respectively.  The  waveforms  of  the  source  signals  and  separated  signals 
are  shown  in  Figure  2.5. 

2.6  Conclusion 

Inspired  by  our  understanding  of  the  subbanding  strategies  used  in  the  early  auditory  system, 
we  presented  subband-based  ICA,  a  powerful  new  algorithm  for  separating  mixed  signals.  By 
performing  separation  in  several  frequency  bands  where  the  SNR  is  higher  than  in  the  original 
signal  mixtures,  subband-based  ICA  is  robust  against  noise  and  converges  to  the  real  demixing 
matrix  quickly.  Furthermore,  by  incorporating  a  best  basis  selection  algorithm,  it  can  be 
adaptive  to  the  properties  of  the  signal  and  noise.  Finally,  the  fact  that  subband  signals,  i.e., 
wavelet  coefficients,  are  more  peaky  and  heavy-tailed  distributed  than  the  original  signals  also 
contributes  to  the  success  of  subband  based  ICA.  The  experimental  results  fully  demonstrate 
its  effectiveness. 

Also,  subband-based  ICA  is  a  computationally  efficient  algorithm  because  it  reduces  the 
computational  complexity  by  performing  separation  on  the  down-sampled  signals  in  several  or 
even  a  single  frequency  band.  Its  speed  is  much  higher  than  those  of  previous  ICA  algorithms. 

Furthermore,  we  can  generahze  subband-based  ICA  by  replacing  the  subband  decomposition 
with  some  appropriate  projection.  For  example,  a  nonlinear  projection  can  be  used  under  some 
criterion,  e.g.,  maximum  likelihood,  to  derive  a  nonlinear  ICA. 

Our  future  work  will  include  using  some  signal  cues,  for  example,  the  pitch  of  acoustic 
signals,  and  available  prior  knowledge,  to  guide  separation.  In  this  way,  we  may  increase  the 
convergence  speed  and  accomplish  the  separation  even  in  cases  where  the  number  of  sensors  is 
less  than  the  number  of  sources.  Some  work  has  been  initiated  in  this  direction. 


If 
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Figure  2.5:  Separation  of  two  musical  signals 
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Chapter  3 


Support  Vector  Machines  for  Pattern  Recognition 


3.1  Introduction 

The  Support  Vector  Machine  (SVM)  is  a  powerful  new  machine  learning  algorithm,  which 
is  rooted  in  statistical  learning  theory  [65].  By  constructing  a  decision  surface  hyperplane 
which  yields  the  maximal  margin  between  positive  and  negative  examples,  SVM  approximately 
implements  the  Structural  Risk  Minimization  (SRM)  Principle.  This  principle  is  based  on  the 
fact  that  the  error  rate  of  a  learning  machine  on  test  data  is  bounded  by  the  sum  of  the  training 
error  rate  and  a  second  term  that  depends  on  the  Vapnik-Chervonenkis  (VC)  dimension,  a  very 
important  concept  presented  in  [65].  SVM  can  minimize  the  training  error  rate  and  the  second 
term  at  the  same  time.  Many  experiments  have  shown  the  good  generalization  performance  of 
SVM  on  problems  such  as  isolated  handwritten  digit  recognition  [58],  object  recognition  [10], 
speaker  identihcation  [57],  and  face  detection  [49]. 

In  the  following  sections,  we  hrst  review  the  theories  related  to  SVM,  including  the  Empirical 
Risk  Minimization  Principle  and  the  Structural  Risk  Minimization  Principle.  We  then  describe 
how  the  Structural  Risk  Minimization  Principle  is  approximately  implemented  by  SVM,  and 
hnally  summarize  and  discuss  properties  of  SVM. 

3.2  Empirical  Risk  Minimization 

3.2.1  Expected  Risk  and  Empirical  Risk 

In  two-class  pattern  recognition,  the  supervised  learning  task  can  be  formulated  as  follows: 
Given  a  set  of  decision  functions 

/(x,A):AeA,  /(x,A):R^^{-l,l}  (3.1) 

where  A  is  a  set  of  abstract  parameters,  and  a  set  of  examples 

(xi,yi),(x2,y2)---,(xi,y/)  xi  e  R^,yi  e  {-1,1} 

drawn  from  an  unknown  distribution  P(x,  y),  hnd  a  function  /(•,  A*)  that  provides  the  smallest 
possible  value  for  the  expected  risk: 

RW  =  J  •^)  -  y)dyidy 

The  functions  /(•,  A)  are  called  hypotheses,  and  the  set  {/(•,  A)  :  A  6  A)  is  called  the  hypothesis 
space  and  is  denoted  by  H.  Thus  the  expected  risk  is  a  measure  of  how  good  a  hypothesis  is 
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at  predicting  the  correct  value  j/  at  a  point  x.  The  function  /(•,  A)  is  called  a  trained  machine, 
given  a  particular  choice  of  A  through  training.  For  example,  the  hypotheses  could  be  Radial 
Basis  Functions  or  Multi-layer  Perceptrons  with  a  hxed  structure.  In  this  case,  the  parameter 
set  A  is  the  set  of  weights  and  biases  of  the  networks. 

Because  the  probability  distribution  P(x)  is  unknown,  it  is  impossible  to  compute  the 
expected  risk  R{X)  directly.  So  instead  of  trying  to  get  the  exact  value  of  R{X),  a  statistical 
approximation  of  R(X),  called  the  empirical  risk,  is  computed  on  the  training  set  as  follows: 

Remp(X)  =  —  y  ]  |/(xi,  X)  —  yi\ 

«=1 

3.2.2  Uniform  Convergence  and  VC  Dimension 

According  to  the  law  of  large  numbers,  the  empirical  risk  Remp  converges  in  probability  to  the 
expected  risk  R.  Hence  a  straightforward  idea  is  minimizing  the  empirical  risk  rather  than 
the  expected  risk.  This  idea  is  called  the  Empirical  Risk  Minimization  (ERM)  Principle.  An 
assumption  in  the  ERM  Principle  is  that  if  Remp  is  converging  to  R,  the  minimum  of  Remp  wiU 
converge  to  the  minimum  of  R  too.  If  this  assumption  actually  does  not  hold,  the  ERM  Principle 
does  not  lead  to  a  correct  inference.  Fortunately,  as  shown  by  Vapnik  and  Chervonenkis  [64], 
this  assumption  holds  if  and  only  if  convergence  in  probability  of  Remp  to  R  is  replaced  by 
uniform  convergence  in  probability.  Here,  uniform  convergence  is  dehned  as  follows: 

for  any  A  G  A  and  e  >  0,  P(sup  |if(A)  —  ifemp(A)|  >  e)  ^  0  as  V  ^  oo 

A 

Vapnik  and  Chervonenkis  also  showed  that  the  finiteness  of  the  VC  dimension  h  of  hypothesis 
space  H  is  the  necessary  and  sufficient  condition  for  uniform  convergence  of  the  ERM.  The  VC 
dimension  of  the  hypothesis  space  El  is  defined  as  follows: 

Consider  functions  that  correspond  to  the  two-class  pattern  recognition  case  as 
defined  in  (3.1).  If  a  given  set  of  I  points  can  be  labeled  in  all  2^  possible  ways,  and 
for  each  labeling,  a  member  of  the  set  {/(•,  A)}  can  be  found  which  correctly  assigns 
those  labels,  we  say  that  that  set  of  points  is  shattered  by  that  set  of  functions.  The 
VC  dimension  for  the  set  of  functions  {/(■,  A)}  is  defined  as  the  maximum  number 
of  training  points  that  can  be  shattered  by  {/(•,  A)}.  In  [11],  it  is  proved  that  the 
VC  dimension  of  the  set  of  oriented  hyperplanes  in  is  JV  +  1. 

Thus  the  VC  dimension  is  a  measure  of  the  complexity  of  H ,  and  it  is  often,  but  not  necessarily, 
related  to  the  number  of  free  parameters  of  /(•,  A)).  For  example,  the  VC  dimension  of  a  set  of 
Radial  Basis  Functions  or  Multi-layer  Perceptrons  is  controlled  by  the  number  of  hidden  units. 

3.2.3  Risk  Bound 

Using  the  concept  of  the  VC  dimension,  Vapnik  [65]  derives  a  bound  on  the  deviation  of  the 
empirical  risk  from  the  expected  risk.  That  is,  with  probability  1  —  r]  where  0  <  r/  <  1,  the 
following  inequality  holds: 

R{X)  <  Remp{X)  + 

where  h  is  the  VC  dimension  of  /(-,  A)),  the  right  hand  side  of  (3.2)  is  called  the  “risk  bound”, 
and  the  second  term  on  the  right  side  is  called  the  “VC  confidence”.  This  bound  is  independent 
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of  P(x,y).  Clearly,  in  order  to  achieve  small  expected  risk,  which  means  good  generahzation 
performance,  both  the  empirical  risk  and  the  VC  conhdence  have  to  be  small.  Becanse  the  VC 
conhdence  is  an  increasing  fnnction  of  the  VC  dimension  h  and  Remp  is  nsnally  a  decreasing 
fnnction  of  h,  there  is  a  tradeoff  between  these  two  terms  when  choosing  the  valne  of  h.  How 
to  choose  an  appropriate  valne  for  h  is  a  difficnlt  bnt  important  problem. 

3.3  Structural  Risk  Minimization 

The  bound  (3.2)  suggests  a  new  induction  principle.  Structural  Risk  Minimization  (SRM). 

The  SRM  Principle  of  Vapnik  [64]  aims  to  solve  the  problem  of  choosing  an  appropriate 
VC  dimension.  Note  that  while  the  VC  conhdence  depends  on  the  VC  dimension  h  of  the 
given  class  of  functions,  the  empirical  risk  Remp  depends  on  the  particular  function  chosen  in 
training.  To  minimize  h  and  Remp  at  the  same  time,  Vapnik  constructs  a  nested  structure  of 
hypothesis  spaces 

Hi  C  H2  C  ■  ■  ■  C  Hn  C  ■  ■  ■ 

with  the  property  that  h(n)  <  h(n+l)  where  h(n)  is  the  VC  dimension  of  the  set  and  can  be 
computed,  or  has  an  upper  bound.  For  each  set  the  goal  of  the  training  is  simply  to  minimize 
Remp-  Then  the  trained  machine  whose  sum  of  VC  dimension  and  Remp  is  minimal  among  all 
trained  machines  is  chosen  as  the  hnal  learning  machine.  SVM  approximately  implements  the 
SRM  Principle  so  that  the  VC  dimension  and  Remp  are  minimized  at  the  same  time. 

3.4  Construction  of  Support  Vector  Machines 

This  section  describes  how  to  construct  a  Support  Vector  Machine  (SVM)  [65]  step  by  step, 
from  the  simplest  case  of  linearly  separable  patterns  to  linearly  non-separable  patterns,  and 
hnally  to  non-separable  patterns. 

3.4.1  Optimal  Hyperplane  for  the  Linearly  Separable  Case 

First,  in  the  linearly  separable  case,  one  wishes  to  hnd  the  best  hyperplane  that  separates  the 
data.  Here,  “hnear  separable”  means  that  one  can  hnd  a  pair  (w,6)  such  that 

w'^Xi  +  6  >  1  V  Xi  6  Class  1 
w'^Xi  +  6<— 1  Vxi6  Class  2 

In  this  case,  the  hypothesis  space  is  the  set  of  functions 

/(x;  w,  6)  =  sign(w'^x  +  6)  (3.5) 

To  make  a  decision  surface  correspond  to  a  unique  parameter  pair  (w,  6),  the  following  constraint 
is  imposed: 

min  Iw'^Xj-  +  6|  =  1  (3-6) 

where  xi,...,x/  are  points  in  the  data  set.  The  hyperplanes  that  satisfy  (3.6)  are  called 
canonical  hyperplanes.  Notice  that  (3.6)  is  just  a  normalization.  As  mentioned  in  Section  3.2.2, 
the  VC  dimension  of  the  canonical  hyperplanes  in  is  iV  +  1,  which  is  the  total  number  of 
free  parameters  in  (3.5).  To  implement  the  SRM  Principle,  a  structure  on  the  set  of  canonical 
hyperplanes  is  produced  by  adding  another  constraint  as  follows: 


(3.3) 

(3.4) 


15 


Let  D  denote  the  diameter  of  the  smallest  N-dimensional  sphere  containing  all  the  points 
xi, . . . ,  X;.  Then  the  set 

/(x;  w,  b)  =  sign(w'^x  +  b)  \  ||w||  <  A  (3.7) 

has  VC  dimension  h  that  satishes  the  following  bonnd  [65]: 

h  +  1  (3.8) 


Moreover,  it  can  be  shown  that  the  distance  from  a  point  x  to  the  hyperplane  dehned  by  the 
pair  (w,  b)  is 


d(x;  w,  b)  = 


w 


^x  +  6| 


w 


(3.9) 


Snbstitnting  (3.6)  into  (3.9),  it  follows  that  the  distance  between  the  canonical  hyperplane  and 
the  closest  data  point  is  Thns  if  ||w||  <  A,  the  distance  between  the  canonical  hyperplane 


and  the  closet  data  point  has  to  be  larger  than  This  means  that  the  constrained  set  of 
canonical  hyperplanes  of  (3.7)  is  the  set  of  hyperplanes  whose  distance  from  the  data  points  is 
at  least  Clearly,  after  the  normahzation,  the  distance  between  the  two  classes  is  j|^.  This 
distance  is  called  the  margin  of  separation. 

According  to  the  bonnd  (3.8),  minimizing  ||w||  will  make  the  VC  dimension  small.  So 
among  the  canonical  hyperplanes  that  correctly  classify  the  data,  the  one  with  the  smallest 
||w||  minimizes  the  risk  bonnd  (3.2).  Formally,  hnding  the  optimal  decision  plane  is  eqnivalent 
to  the  following  qnadratic  programming  (QP)  problem: 


Minimize  #(w)  =  |||w|p 

w,  b 

snbject  to  ^^(w'^Xi  +  6)  >  f  i  =  1 . .  .1  (3.10) 


This  constrained  optimization  problem  is  called  the  primal  problem.  Here  the  cost  fnnction 
#(w)  is  a  convex  fnnction  of  w  and  the  constraints  are  linear  in  w. 

This  problem  can  be  solved  by  the  techniqne  of  Lagrange  Mnltipliers.  The  Lagrangian 
fnnction  is  constrncted  as  follows: 

1  ^ 

X(w,  6,  A)  =  -||w||^  -  ^  Xi[yi(w'^xi  +  6)  -  1]  (3-11) 

^  i=i 

where  A  =  (Ai, . . . ,  A;)  is  the  vector  of  non-negative  Lagrange  mnltipliers  (notice  that  here  the 
dehnition  of  A  is  different  from  that  in  (3.1)).  The  solntion  to  this  optimization  problem  is 
determined  by  the  saddle  point  of  i(w,  b,  A),  which  has  to  be  minimized  with  respect  to  w  and 
6,  and  maximized  with  respect  to  A  >  0.  By  differentiating  X(w,6,  A)  with  respect  to  w  and 
6,  it  follows  that 

l 

w  =  ^  Aij/iXi 

«=i 
i 

^  \yi  =  0 

The  solntion  vector  w  is  dehned  in  terms  of  a  linear  combination  of  training  vectors.  From  the 
training  procednre  the  optimal  w*  can  be  explicitly  and  nniqnely  determined  by  virtne  of  the 


(3.12) 

(3.13) 
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convexity  of  the  Lagrangian.  To  determine  the  optimal  b*,  however,  one  needs  to  resort  to  the 
Karush-Kuhn-Tucker  (KKT)  “complementary”  condition  [27]: 

A*  +  6*)  —  1]  =  0  for  i  =  1,  2, . . . ,  I  (3-14) 

Only  those  Lagrange  mnltiphers  exactly  satisfying  (3.14)  can  assnme  nonzero  valnes.  The  data 
points  (xj,  j/j)  for  which  the  corresponding  A*  >  0  are  called  support  vectors.  From  a  geometric 
perspective,  the  snpport  vectors  are  those  data  points  that  lie  closest  to  the  decision  snrface. 
From  (3.14)  it  follows  that  the  optimal  b*  can  be  compnted  as 

0  =  y*  -  w  X* 

for  any  snpport  vector.  In  practice,  it  is  nnmerically  safer  to  take  the  mean  valne  of  all  snch  6*s. 

From  (3.12)  and  (3.13),  the  original  primal  problem  can  be  reformnlated  into  its  dnal  prob¬ 
lem: 

Maximize  Q{A)  =  Y!i=i  A*  -  ^  Ei=i  Ej=i 
snbject  to  Ei=i  =  0 

Ai  >  0  for  i=  1,2,...,/  (3.15) 

Also,  we  can  reformnlate  the  decision  fnnction  (3.5)  as 

/(x)  =  sign^^j/iAiX^Xi  +  6^  (3.16) 

where  (xj,  j/^)  are  snpport  vectors. 


3.4.2  Soft  Margin  Hyperplane  for  the  Linearly  Non-Separable  Case 

In  the  linearly  non-separable  case,  there  exists  at  least  one  data  point  (xi,j/*)  that  violates 
the  constraint: 

j/i(w'^xi  +  6)  >  1,  i  =  l,2,...,l 

Accordingly,  the  margin  of  separation  is  said  to  be  soft.  To  deal  with  the  non-separable 
case,  one  needs  a  new  set  of  nonnegative  scalar  variables, dehned  as  follows: 

yi{-w'^Xi  +  b)  >  1  -  (i,  i=l,2,...,l  (3-17) 

measnre  the  oriented  distance  of  a  data  point  to  the  decision  hyperplane.  When 
^8  >  1,  the  data  point  falls  on  the  wrong  side  of  the  decision  hyperplane.  In  this  case,  the 
snpport  vectors  are  the  data  points  that  satisfy  (3.17)  with  eqnahty  even  if  >  0. 

The  following  fnnction  measnres  the  total  nnmber  of  misclassihcations: 

^(0  =  E^(6-1) 

«=1 

where  the  indicator  fnnction  /(^)  is  dehned  by 


^(0 


0  if  C  <  0 
1  if  C  >  0 
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Unfortunately,  using  the  indicator  function  in  $(0  results  in  a  nonconvex  optimization  problem 
that  is  NP-complete.  To  make  the  optimization  problem  tractable,  is  approximated  by 

i=l 

Finally  in  order  to  maximize  the  margin  and  minimize  the  number  of  misclassihcations 
simultaneously,  SVM  solves  the  following  primal  problem: 

Minimize  ^(w,^)  =  |||w|p  +  C  (3.18) 

w,6,^ 

subject  to  yi(w'^Xi  +  b)  >  1  —  i  =  1 . .  .1  (3.19) 

>  0  i  =  (3.20) 

Minimizing  the  brst  term  in  (3.18)  leads  to  minimizing  the  VC  dimension  of  the  learning  ma¬ 
chine,  and  minimizing  the  second  term  controls  the  empirical  risk.  Therefore,  this  approach 
constitutes  an  approximate  implementation  of  the  SRM  principle.  Here  the  parameter  C  con¬ 
trols  the  tradeoff  between  the  complexity  and  the  empirical  risk  of  of  the  trained  machine. 

As  in  the  previous  section,  the  dual  problem  can  be  formulated  as 

Maximize  (5(A)  =  ELi  \  ELi  Ej=i  XiXjViyjy^-^y^j 
subject  to  Ei=i  =  0 

0  <  Ai  <  (T  for  i  =  1,2,...,/  (3.21) 

The  dual  problem  for  the  case  of  non-separable  patterns  differs  from  that  for  the  simple  case 
of  linearly  separable  patterns  in  a  minor  but  important  way:  the  constraint  >  0  is  replaced 
by  the  more  stringent  constraint  0  <  A*  <  (T.  Except  for  this,  the  optimization  for  the  non- 
separable  case  and  the  computation  of  the  optimal  w*  are  done  in  the  same  way  as  in  the 
linearly  separable  case. 

In  addition,  the  optimal  bias  value  is  computed  in  a  way  similar  to  that  described  before. 
Specihcally,  from  the  KKT  conditions  it  follows  that 

A|  [yi(w*  Xi  +  6*)  —  1  +  =  0  for  i  =  1,  2, . . . ,  I  (3.22) 

To  make  aU  the  variables  nonnegative,  it  can  be  derived  from  the  Lagrange  tech¬ 

nique  that 

/r,6  =  0,  f  =  l,2,...,l  (3.23) 

where  the  are  the  Lagrange  multipliers.  Setting  the  derivative  of  the  Lagrangian  function 
for  the  primal  problem  with  respect  to  the  variable  to  zero  leads  to 

A»  +  //,  =  C  (3.24) 

From  (3.23)  and  (3.24),  it  follows  that 

//,  =  0  if  A,  <  (T  (3.25) 

Hence  the  optimal  bias  b*  can  be  computed  by  taking  any  data  point  (xj,  yi)  in  the  training 
set  for  which  we  have  0  <  A^  <  L7  and  therefore  jii  =  0,  and  using  that  data  point  in  (3.22).  As 
mentioned  before,  it  is  numerically  safer  to  take  the  mean  value  of  b*  resulting  from  all  such 
training  data. 
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3.4.3  Nonlinear  Decision  Surfaces 


This  section  extends  the  hnear  optimal  hyperplane  to  more  comphcated  decision  snrfaces  for 
the  real-world  pattern  recognition  problem.  The  extension  involves  two  operations: 

•  First,  nonlinearly  map  an  inpnt  variable  x  into  a  high- dimensional  feature  space. 

•  Then,  constrnct  an  optimal  hyperplane  in  the  high- dimensional  featnre  space. 

The  hrst  operation  is  jnstihed  by  Cover’s  theorem  on  the  separability  of  patterns,  which  may 
be  stated  as  follows  [17]: 

“A  complex  pattern-classihcation  problem  cast  in  a  high- dimensional  space  nonlin¬ 
early  is  more  likely  to  be  linearly  separable  than  in  a  low- dimensional  space.” 

In  the  second  operation,  an  optimal  hyperplane  is  bnilt  in  the  same  way  as  described  in  the 
previons  sections,  except  that  the  snpport  vectors  are  not  drawn  from  the  inpnt  space,  bnt  from 
the  high- dimensional  featnre  space. 

Let  denote  a  set  of  nonlinear  transformations  from  the  inpnt  space  to  the  featnre 

space,  where  M  is  the  dimension  of  the  featnre  space.  The  nonlinear  mapping  is  dehned  as 

X  ^  (,o(x)  =  ((,01  (x),  (,02(x), . . . ,  (,om(x))  (3.26) 

Then  a  SVM  is  constrncted  as  follows: 

/(x)  =  sign(w*'^(,o(x)  +  b*)  =  sign  ^  ^  yiA*(,o'^(x)(,o(xi)  +  b*\  (3.27) 

«=i  ^ 

Let  K  denote  the  inner-product  kernel,  which  is  dehned  as 

M 

A (x,  z)  =  (,o'^(x)(^(z)  =  (,0j(x)(^j(z)  (3.28) 

j=0 

Snbstitnting  (3.28)  into  (3.27)  yields 

/(x)  =  sign(^^yiA*A(x,Xi)  +  6*^  (3.29) 

and  the  QP  problem  (3.21)  becomes 

Maximize  (5(A)  =  ^-^i  A^  -  ^  ^-^i  ^^-^i  XiXjPiyjlif^i,  x^) 
snbject  to  Yli=i  ^iVi  =  0 

0  <  Ai  <  (T  for  i  =  1,2,...,/  (3.30) 

The  nse  of  the  kernel  trick  greatly  rednces  the  high  compntational  bnrden  of  the  nonlinear 
mapping  into  the  high- dimensional  space  in  SVM. 

Note  that  the  expansion  (3.28)  is  a  special  case  of  Mercer’s  theorem  [44].  According  to 
Mercer’s  theorem,  the  fnnctions  (yOj(x)  are  eigenfnnctions  of  the  expansion.  They  are  positive 
dehnite.  In  theory,  the  dimensionality  of  the  featnre  space  (i.e.,  the  nnmber  of  eigenfnnctions) 
can  be  inhnitely  large.  Mercer’s  theorem  provides  a  way  to  check  whether  a  candidate  fnnction 
is  really  an  inner-prodnct  kernel  in  some  space.  Some  commonly  nsed  inner-prodnct  kernels 
are  listed  in  Table  3.1. 


19 


Inner- Product  Kernel 

Type  of  Classifier 

/f  (x,  z)  =  exp(  — 1  X  —  zlp^) 

Radial  basis  function  network 

/f(x,z)  =  (1  +  x^z)"' 

Polynomial  Learning  Machine 

(x,  z)  =  tanh  (x^  z  —  6) 
(only  for  some  values  of  6 
which  satisfy  Mercer’s  theorem) 

Multi-layer  Perceptron 

Table  3.1:  Summary  of  commonly  used  inner-product  kernels 


3.5  Summary  and  Discussion 

In  this  section  we  first  summarize  some  important  properties  of  SVM: 

•  As  an  approximate  implementation  of  the  SRM  Principle,  SVM  provides  a  method  of 
minimizing  the  empirical  risk  and  the  VC  dimension  at  the  same  time,  so  that  the  risk 
bound  of  the  trained  machine  can  be  minimized,  i.e.,  the  trained  machine  has  good  gen¬ 
eralization  performance. 

•  By  reformulating  the  primal  optimization  problem  into  its  dual  problem  and  using  a 
suitable  inner-product  kernel,  SVM  controls  the  model  complexity  independently  of  the 
dimensionality  of  the  feature  space.  Actually,  an  inhnite  feature  space  is  allowed  in  SVM. 

•  Moreover,  the  convex  cost  function  in  the  QP  problem  guarantees  that  SVM  will  hnd  a 
globally  optimal  solution,  while  many  other  learning  algorithms  suffer  from  falling  into 
local  extrema. 

•  By  solving  the  QP  problem  during  the  training  phase,  SVM  automatically  tunes  all  the 
parameters  in  a  learning  scheme. 

Though  originally  derived  from  the  SRM  Principle  to  address  the  problem  of  the  tradeoff 
between  model  complexity  and  generalization  ability,  SVM  is  closely  related  to  other  known 
techniques  and  research  problems: 

•  The  support  vectors  are  usually  sparse.  They  only  constitute  a  fraction  of  the  total  number 
of  examples  in  the  training  set.  Using  the  reproducing  property  of  the  Reproducing 
Kernel  Hilbert  Space  (RKHS),  Girosi  [29]  shows  an  equivalence  between  SVMs  in  the 
noiseless  case  and  a  Sparse  Approximation  (SA)  scheme  that  resembles  the  Basis  Pursuit 
De-Noising  algorithm  [14]. 

•  Also  in  [29],  Girosi  gives  a  derivation  of  the  SVM  algorithm  in  the  framework  of  regu¬ 
larization  theory.  In  [24],  Evgeniou  et  al.  give  a  a  unified  framework  for  regularization 
networks  and  SVM.  The  reformulation  of  SVM  in  regularization  theory  reveals  the  con¬ 
nection  between  SVM  and  other  known  techniques.  However,  it  hides  the  relation  between 
SVM  and  the  SRM  Principle. 

•  SVM  provides  high  generalization  performance  without  incorporating  any  prior  knowledge 
of  the  problem.  An  important  research  topic  is  how  to  incorporate  problem- domain 
knowledge  into  SVM  to  further  improve  its  performance.  Some  proposed  approaches 
include  adding  an  additional  term  that  represents  prior  knowledge  in  the  cost  function, 
using  prior  knowledge  to  design  the  kernel  function  [59],  and  adding  virtual  examples  into 
the  training  set  [58].  More  efficient  and  natural  ways  of  adding  prior  knowledge  into  SVM 
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have  yet  to  be  developed.  For  example,  integrating  Bayesian  learning  theory  into  SVM 
might  be  a  good  way  of  exploiting  prior  information. 

•  The  kernel  trick  in  SVM  can  also  be  nsed  in  other  algorithms  that  are  based  on  the 
inner  prodnct  of  the  data.  For  example,  Principal  Component  Analysis  can  be  done  in 
high-dimensional  featnre  space  by  nsing  a  snitable  nonlinear  kernel  fnnction  [60].  Fisher 
discriminant  analysis  also  nses  a  similar  idea  [45]. 
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Chapter  4 


A  Hybrid  ICA/SVM  Learning  Scheme  and  its  Application  to 
Face  Detection 


4.1  Introduction 

In  this  chapter,  we  propose  a  new  hybrid  nnsnpervised/snpervised  learning  scheme  that  in¬ 
tegrates  Independent  Component  Analysis  with  the  Snpport  Vector  Machine  (SVM),  and  we 
apply  this  new  learning  scheme  to  the  face  detection  problem.  As  a  powerfnl  nnsnpervised 
learning  algorithm,  ICA  can  not  only  “bhndly”  separate  mixed  signals  as  shown  in  Chapter  2, 
bnt  also  effectively  extract  low-level  featnres  in  signals.  In  [9],  Bell  and  Sejnowski  point  ont 
that  the  independent  components  of  natnral  scenes  are  edge  hlters.  And  in  [48],  Olshansen  and 
Field  show  an  eqnivalence  between  sparse  coding  and  an  ICA  algorithm  in  the  case  of  no  noise 
and  a  sqnare  system  (i.e.,  the  dimensionalities  of  ontpnt  and  inpnt  are  same).  Using  their  fea- 
tnre  extraction  capability,  ICA  algorithms  have  been  snccessfnUy  nsed  in  face  recognition  and 
facial  expression  analysis,  and  have  achieved  better  resnlts  than  Principal  Component  Analysis 
(PCA)  [8,  21,  42].  On  the  other  hand,  SVM  is  a  promising  snpervised  learning  algorithm.  As 
discnssed  in  Chapter  3,  by  minimizing  the  empirical  risk  on  the  training  data  set  and  the  model 
complexity,  measnred  by  the  VC  dimension,  at  the  same  time,  SVM  gives  good  generahzation 
performance  on  pattern  recognition  problems.  Combining  these  two  learning  algorithms  yields 
a  powerfnl  hybrid  learning  scheme.  In  this  chapter  we  apply  this  new  learning  scheme  to  the 
face  detection  problem.  Specihcally,  we  nse  ICA  in  two  different  ways  to  extract  low-level  fea¬ 
tnres  from  a  window  sliding  over  an  image,  and  then  apply  SVM  at  a  high  level  to  classify 
the  extracted  ICA  featnres  as  face  or  not.  In  addition,  to  rednce  the  time  of  the  detection 
procednre,  a  skin-color  hlter  is  implemented  to  hnd  the  candidate  face  regions  in  an  image,  so 
that  the  sliding  window  moves  over  rednced  image  regions.  Experimental  resnlts  demonstrate 
the  effectiveness  of  the  new  hybrid  learning  scheme  on  the  face  detection  problem. 

The  rest  of  this  chapter  is  organized  as  follows.  Section  4.2  gives  a  short  review  of  face 
detection  methods.  Section  4.3  presents  the  method  of  hnding  candidate  face  regions  in  images 
nsing  a  skin-color  hlter.  Section  4.4  presents  the  hybrid  ICA/SVM  learning  scheme.  Section  4.5 
describes  the  face  detection  system  based  on  this  scheme.  Section  4.6  reports  onr  experimental 
resnlts.  Finally,  Section  4.7  contains  conclnsions  and  discnssion. 

4.2  Literature  Review  on  Face  Detection 

Face  detection  has  important  applications  in  varions  areas,  snch  as  intelligent  hnman-compnter 
interaction,  video  snrveillance,  video  indexing,  and  object-based  video  coding.  These  applica¬ 
tions  have  contribnted  to  an  increasing  research  interest  in  face  detection  in  recent  years.  In 
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this  section,  we  give  a  short  review  of  the  technical  literatnre  on  face  detection. 

In  [62]  Snng  and  Poggio  propose  an  example-based  learning  approach  to  detecting  frontal 
hnman  faces.  They  nse  six  Ganssian  clnsters  to  model  the  distribntions  of  face  patterns  and 
six  other  Ganssian  clnsters  for  non-face  patterns,  and  nse  two  distance  metrics  to  train  a  Mnl- 
tilayer  Perceptron  as  the  classiher.  Rowley  et  ah,  in  [54]  and  [55],  nse  a  retinally  connected 
nenral  network  to  detect  faces  in  an  image.  Mnltiple  networks  are  nsed  to  improve  system 
performance.  In  [49]  Osnna  et  al.  apply  a  snpport  vector  machine  to  face  detection  and  obtain 
slightly  better  resnlts  than  Snng  and  Poggio  on  two  test  sets.  In  [50]  Qian  and  Hnang  report 
a  detection  scheme  that  combines  template  matching  and  a  featnre-based  detection  algorithm 
nsing  hierarchical  Markov  random  helds  (MRP)  and  maximnm  a  posteriori  probability  (MAP) 
estimation.  In  [56]  Samaria  nses  Hidden  Markov  Models  (HMM)  for  face  detection.  In  [16] 
Colmenarez  and  Hnang  nse  Knllback  divergence  to  maximize  the  discrimination  between  pos¬ 
itive  and  negative  examples  of  faces.  A  family  of  discrete  Markov  processes  is  nsed  to  model 
faces  and  backgronnd  patterns.  Detection  is  based  on  the  likelihood  ratio  compnted  dnring  the 
training  phase.  In  [46]  Moghaddam  and  Pentland  propose  a  detection  method  that  is  based 
on  density  estimation  in  a  high- dimensional  space  nsing  an  eigenspace  decomposition.  In  [51] 
Rajagopalan  et  al.  apply  higher-order  statistics  and  HMMs  to  detect  faces.  In  [53],  Roth  et 
al.  present  a  face  detection  method  that  nses  a  Sparse  Network  of  Windows  (SNoW)  learning 
architectnre,  which  has  been  snccessfnlly  nsed  in  the  natnral  langnage  domain. 

In  addition  to  these  statistical  methods,  in  [69]  Ynille  nses  deformable  templates  to  model 
facial  featnres.  In  this  approach,  facial  featnres  are  described  by  parameterized  templates.  The 
best  ht  of  the  elastic  model  is  obtained  by  minimizing  an  energy  fnnction.  Textnre  information 
has  also  been  nsed  for  detecting  faces  [18]. 

To  speed  np  the  detection  procednre,  color  and  motion  information  can  be  exploited  in 
color  images  and  video  seqnences  [66].  A  single  Ganssian  or  a  mixtnre  of  Ganssians  can 
be  nsed  to  model  the  skin  color  distribntion.  Expectation-Maximization  (EM),  an  iterative 
maximnm-likehhood  estimation  algorithm,  provides  an  effective  way  of  learning  a  Ganssian 
mixtnre  model  [52].  More  recently,  several  modnlar  systems  combining  shape  analysis,  color 
segmentation  and  motion  information  have  been  nsed  for  locating  and  tracking  faces  in  a  video 
seqnence  [30]. 

In  [67]  Yang  gives  an  extensive  snrvey  efface  detection  methods.  In  [13],  Chellappa  et  al. 
give  a  comprehensive  snrvey  of  the  literatnre  on  hnman  and  machine  recognition  of  faces,  which 
is  closely  related  to  face  detection. 

After  this  short  review  of  face  detection,  we  are  ready  to  present  onr  face  detection  system 
based  on  the  hybrid  ICA/SVM  learning  scheme.  In  a  preliminary  section,  we  hrst  introdnce  the 
preprocessing  procednre  nsed  in  onr  detection  system,  which  inclndes  two  main  components,  a 
skin  color  hlter  and  histogram  eqnalization. 

4.3  Skin  Color  Filter  and  Other  Preprocessing 

Thongh  skin  color  can  be  nicely  modeled  as  a  mixtnre  of  Ganssians  as  mentioned  before,  onr 
system  nses  a  simpler  method  to  hnd  possible  skin  regions  in  an  image  becanse  we  only  nse  it 
to  rednce  the  search  area  in  the  image  instead  of  hnding  exact  face  locations.  This  method  is 
a  modihed  version  of  the  skin  hlter  proposed  in  [26]  and,  in  essence,  is  a  thresholding  approach 
in  hne  and  satnration  space,  ft  inclndes  the  following  steps: 

•  First,  the  inpnt  color  image  in  RGB  format  is  transformed  to  log-opponent  (IRgBy)  valnes, 
and  from  these  valnes  the  amplitnde,  hne,  and  satnration  are  compnted.  The  conversion 
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from  RGB  to  log-opponent  is  compnted  as  follows: 


^  L{R)  +  L{B}  +  L(G) 

3 

(4.1) 

Ry  =  L{R,)  -  L{G) 

(4.2) 

(4.3) 

where  L{x)  =  1051ogl0(a;  -f  1). 

•  Second,  the  log  opponents  are  transformed  into  hne-satnration  space  as  follows: 

H  =  arctan^(i?,^/i?j,) 

(4.4) 

5  =  ^Rl  +  Bl 

(4.5) 

where  H  and  S  represent  the  hne  and  satnration  images 

respectively,  and  the  nnit  for  H 

is  degrees. 

Fignres  4.1,  4.2,  and  4.3  show  a  color  image,  its  hne  image,  and  its  satnration  image 
respectively.  Note  that  there  is  a  strong  blocking  effect  in  the  hne  image  (Fignre  4.2). 
The  reason  is  that,  in  image  coding,  many  fewer  bits  are  assigned  to  the  color  information 
than  to  the  gray  intensity  information.  This  snggests  that  nsing  only  color  information 
to  locate  faces  in  images  is  not  robnst  to  image  coding  error. 


Fignre  4.1:  A  test  image  for  face  detection 


•  Next,  by  a  simple  thresholding  method,  we  prodnce  a  binary  mask  that  locates  face 
candidate  regions  in  an  image.  The  thresholding  method  is  dehned  as  follows: 


M, 


1  if  120  <  <  175 

0  if  15  <  <  75 


where  M^^y,  are  the  valnes  of  the  binary  face  candidate  mask,  hne  image,  and 

satnration  image  at  pixel  (x,y)  respectively. 

Fignre  4.4  shows  the  binary  face  candidate  mask  for  the  image  in  Fignre  4.1. 

ft  seems  that  the  skin  color  hlter  works  perfectly,  as  shown  in  Fignre  4.4.  However, 
sometimes  the  skin  color  hlter  does  not  work  well,  ft  tends  to  falsely  detect  highly 
satnrated  red  and  yellow  areas  as  face  candidate  areas.  The  reason  for  the  problem  may 
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Figure  4.2:  Test  image  in  hue  space 


Figure  4.3:  Test  image  in  saturation  space 
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Figure  4.4:  Possible  face  areas  for  test  image 


be  an  improper  threshold  or  the  low-rate  image  coding.  Clearly,  this  simple  approach  is 
inadequate  for  hnding  accurate  face  locations  in  an  image.  However,  it  is  acceptable  for 
our  system  because  we  only  use  it  as  part  of  the  preprocessing  to  reduce  the  image  area 
scanned  by  a  sliding  window. 

•  After  the  binary  face  candidate  mask  is  produced,  simple  morphological  operations  are 
performed  on  the  mask,  which  include  binary  closure  (i.e.,  dilation  followed  by  erosion) 
and  removal  of  small  blobs,  because  small  blobs  usually  arise  from  non- face  regions. 

The  binary  mask  after  the  morphological  operations  is  shown  in  Figure  4.5. 


Figure  4.5:  Final  face  candidate  mask  for  the  test  image 


Using  the  mask,  we  can  hnd  candidate  face  areas  in  an  image,  and  the  sliding  window  can 
move  only  over  the  candidate  areas  instead  of  the  whole  image.  In  order  to  compensate  for 
different  illuminations,  camera  responses,  etc.,  histogram  equalization  is  performed  over  the 
image  blocks  dehned  by  the  sliding  window. 

4.4  Hybrid  ICA/SVM  Learning  Scheme 

In  this  section  we  present  a  new  hybrid  learning  scheme  that  integrates  fCA  and  SVM.  By 
exploiting  higher-order  statistics,  fCA  can  hnd  an  independent  basis  for  the  data,  and  obtain 
a  better  clustering  and  representation  of  the  data  than  PCA.  When  apphed  to  natural  images. 
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ICA  filters  are  edge  filters.  When  training  on  a  large  nnniber  of  natnral  image  blocks,  we  can 
even  get  a  wavelet-like  ICA  basis.  Bnt  in  contrast  to  wavelet  analysis,  the  ICA  basis  is  adaptive 
to  the  training  data. 

In  the  hybrid  learning  scheme,  after  ICA  featnre  extraction,  SVM  is  applied  to  classify  the 
featnres.  One  common  characteristic  of  ICA  and  SVM  is  sparsity.  The  ICA  ontpnt  is  sparse, 
and  the  snpport  vectors  in  SVM  are  also  sparse.  In  the  following  section,  we  describe  this 
hierarchical  sparse  learning  architectnre. 

4.4.1  ICA  Feature  Extraction 

Redundancy  Reduction,  ICA,  and  Sparse  Coding 

As  shown  in  [5,  6,  7,  20],  an  important  characteristic  of  sensory  processing  in  the  brain  is  rednn- 
dancy  redaction.  One  method  of  achieving  rednndancy  redaction  is  based  on  the  minimization 
of  mntnal  information  of  the  system  ontpnts.  According  to  the  theory  developed  in  Chap¬ 
ter  2,  we  know  that  ICA  is  snch  an  algorithm.  Actnally,  experimental  resnlts  have  shown  that 
trained  ICA  bases  are  very  similar  to  the  receptive  helds  of  simple  cells  in  mammalian  visnal 
cortex  [9,  47,  48].  [63]  reports  a  detailed  comparison  between  ICA  featnres  and  the  properties 
of  simple  cells  in  the  macaqne  primary  visnal  cortex,  and  hnds  good  matches  to  most  of  the 
parameters,  especially  if  video  seqnences  are  nsed  instead  of  still  images. 

Another  method  of  redncing  rednndancy  is  sparse  coding  [7,  25,  47],  which  adds  to  the  cost 
fnnction  a  term  that  represents  the  sparseness  of  the  ontpnt.  If  the  data  has  a  snper-Ganssian 
distribntion,  sparse  coding  resnlts  in  approximate  rednndancy  rednction.  These  two  approaches 
are  eqnivalent  to  each  other  in  some  cases,  as  shown  in  [48]. 

Connection  between  Projection  Pursuit  and  ICA  Feature  Extraction 

In  addition  to  its  close  relation  to  sparse  coding,  ICA  featnre  extraction  is  related  to  projection 
pursuit  [32].  Projection  pnrsnit  tries  to  hnd  “interesting”  projections  for  mnltidimensional 
data.  In  [32]  Hnber  argnes  that  the  most  interesting  directions  are  those  that  show  the  least 
Ganssian  distribntions.  At  the  same  time,  ICA  can  be  interpreted  as  a  search  for  a  projection 
snch  that  the  nnmixed  signals  have  maximal  non-Ganssianity  [35].  The  nse  of  the  same  criterion 
in  projection  pnrsnit  and  ICA  reveals  the  connection  between  these  techniqnes. 

Two  Methods  of  ICA  Feature  Extraction 

Given  the  ICA  generative  model  nnder  the  noise-free  assnmption 


X  =  As 


and  the  ICA  reconstrnction  model 

y  =  lTx, 

there  are  two  different  methods  of  extracting  featnres  nsing  ICA.  These  two  methods  were 
proposed  by  Bartlett  et  al.  [8]  for  face  recognition  and  were  shown  to  provide  eqnally  good 
recognition  performance.  The  hrst  method  is  to  hnd  the  statistically  independent  basis  images 
by  ICA  and  then  represent  the  image  by  the  coefficients  of  the  projection  on  those  basis  images 
[8].  The  second  method  is  to  hrst  rednce  the  dimensionality  of  the  image,  and  then  represent 
the  image  by  the  independent  coefhcients  that  are  obtained  by  applying  ICA  [8,  42].  These  two 
methods  are  explained  in  detail  in  what  follows: 


27 


Method  1:  Independent  Image  Basis 

In  Method  1,  each  row  component  in  the  mixtnre  x  represents  a  training  image  and  each  row 
component  in  y  represents  an  independent  image  basis.  Note  that  here  an  image  is  represented 
by  a  vector  that  is  the  concatenation  of  aU  the  colnmns  in  the  image.  An  ICA  representation 
of  an  image  is  the  vector  of  coefficients  of  the  projection  on  the  independent  bases  in  y. 

Becanse  the  dimensionality  of  y  is  the  same  as  that  of  x,  it  becomes  necessary  to  control 
the  nnmber  of  independent  bases  when  the  size  of  the  training  set  is  very  large.  Since  we 
assnme  the  images  in  x  are  linear  combinations  of  nnknown  independent  sonrces  in  the  ICA 
generative  model,  we  do  not  lose  information  by  replacing  the  original  images  with  their  m 
linear  combinations. 

If  we  view  all  the  pixels  in  an  image  as  an  observation  of  a  random  vector,  then  the  images  are 
linear  combinations  of  the  eigenvectors  of  their  covariance  matrix.  We  choose  those  eigenvectors 
that  correspond  to  the  m  maximal  eigenvalnes  as  the  ICA  training  images,  becanse  they  contain 
most  of  the  energy  of  the  original  image  set.  Denote  these  eigenvectors  by  pi  where  i  =  1, . . . ,  m. 
Then  pi  is  a  N  X  1  colnmn  vector,  where  N  is  the  pixel  nnmber  in  the  image.  Denote  the  matrix 
that  contains  the  m  colnmn  vectors,  pi,  by  Pm-  By  performing  ICA  on  we  can  obtain  a 
matrix  of  m  independent  sonrce  images.  Note  that  here  is  a  row  in  x.  Formally,  we  have 
the  following  steps: 

First  from  the  matrix  x,  we  hnd  the  eigenvector  matrix  Pm-  Then  we  take  P^  as  the  mixtnre 
X  and  apply  the  ICA  algorithm  as  follows: 

y  =  wp^ 

^  Pi  =  IF-iy  (4.6) 


where  each  row  of  y  represents  an  independent  image  basis. 

Finally,  an  ICA  representation  of  an  image  is  obtained  as  follows:  The  set  of  images  in  x 
can  be  represented  by  their  coordinates  in  the  basis  of  eigenvectors,  Rm  =  A  minimnm 

sqnared  error  approximation  of  x  is  obtained  by 


X  —  R  P^  —  xP  P^ 

-^rec  —  rn  —  m' 


(4.7) 


Snbstitnting  (4.6)  into  (4.7),  we  get 

Xrec  =  PmIF“V  =  xPrnIF“V-  (4-8) 

The  rows  of  xPmVF“^  are  the  coefficients  for  the  hnear  combination  of  independent  bases  in  y. 
Thns  for  the  representation  of  a  test  image,  which  is  a  row  vector  /ixJV;  the  ICA  representation  is 

c  =  IPmW-^  (4.9) 

where  PmW~^  is  obtained  dnring  the  ICA  training  procednre. 

Fignres  4.6  and  4.7  show  two  sets  of  learned  independent  bases  of  two  different  dimension¬ 
alities. 

Method  2:  Independent  Projection  Coefficients 

In  Method  2,  we  view  an  image  as  an  observation  of  a  random  vector,  and  rednce  its  di¬ 
mension  from  JV,  the  total  pixel  nnmber,  to  m.  Let  x^  denote  the  matrix  containing  the 
dimension-rednced  images  in  its  colnmns.  Let  {pi}'^^  again  denote  the  eigenvectors  that  cor¬ 
respond  to  the  m  maximal  eigenvalnes  of  the  covariance  matrix  of  x,  and  let  Pm  denote  the 
matrix  whose  colnmns  are  Then  we  have 
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(4.10) 


■  ■■■  B  n  ' mmm 


Figure  4.6:  Independent  basis  images  obtained  from  230  eigenvectors 
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Figure  4.7:  Independent  basis  images  obtained  from  45  eigenvectors 
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We  apply  the  ICA  algorithm  to  and  obtain 


y  =  =  WP^x  (4.11) 

So  for  a  test  image  which  is  represented  by  a  colnmn  vector  InxI:  ifs  ICA  representation, 
c,  is  given  by 

c=WFll  (4.12) 

where  the  matrix  prodnct  W is  obtained  in  the  training  procednre.  Denote  the  prodnct 
W P^  by  U.  The  colnmns  of  U  are  the  basis  images  for  the  ICA  representation  in  Method  2. 
Note  that  here  every  component  in  the  representation  vector  c  is  independent  of  every  other 
component,  while  in  method  1,  each  basis  image  is  independent  of  every  other  basis  image. 
Details  are  provided  in  Bartlett  et  al.  [8]. 

Fignre  4.8  shows  a  set  of  learned  basis  images  when  the  dimensionality,  m,  is  80. 


Fignre  4.8:  The  ICA  basis  images  obtained  from  80  eigenvectors  by  Method  2 


4.4.2  SVM  Classification  of  ICA  Features 

After  obtaining  ICA  featnres,  we  bnild  the  SVM  training  set  where  di  is  the  class 

type  of  featnre  c*.  For  the  face  detection  problem,  di  =  1  when  c*  is  extracted  from  a  face 
image,  and  di  =  —1  when  c,  is  from  a  non- face  image.  I  is  the  size  of  the  training  set.  SVM  for 
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classification  has  been  discnssed  in  detail  in  Chapter  3.  Here  we  mention  another  point  abont 
SVM.  For  a  large  training  data  set,  the  SVM  training  procednre  is  time  consnming.  How  to 
speed  np  SVM  training  is  an  important  research  topic  in  the  SVM  research  commnnity.  Bnt 
different  training  strategies  shonld  only  affect  the  training  speed,  not  the  learning  ability  of 
SVM.  Here  we  nsed  the  method  described  in  [49]  for  training. 

4.5  The  Hybrid  ICA/SVM  based  Face  Detection  System 

In  this  section  we  describe  how  to  bnild  a  complete  face  detection  system  based  on  the  hybrid 
ICA/SVM  learning  scheme.  The  detection  system  inclndes  training  and  testing  parts.  The 
training  part  consists  of  the  following  steps: 

1.  In  a  training  set  for  face  detection,  face  and  non- face  patterns  are  assigned  to  I  and  —I 
respectively.  Each  of  these  patterns  has  20  X  20  pixels.  The  face  patterns  inclnde  faces 
with  different  facial  expressions  and  nnder  different  views;  see  Fignres  4.9  and  4.10  for 
some  examples. 

2.  The  data  is  preprocessed  to  compensate  for  variations  in  the  training  patterns: 

•  In  order  to  rednce  backgronnd  noise,  pixels  close  to  the  bonndary  of  each  rectangnlar 
training  pattern  are  removed  by  a  binary  mask. 

•  Histogram  eqnalization  is  then  performed  to  compensate  for  illnmination  differ¬ 
ence,  etc. 

3.  After  preprocessing,  the  ICA  algorithm  is  apphed  to  the  data  to  learn  the  independent  im¬ 
age  bases  which  are  nsed  for  featnre  extraction.  Since  two  different  ICA  featnre  extraction 
methods  can  be  applied,  we  can  obtain  two  different  set  of  image  bases  and  featnres. 

4.  Using  the  ICA  featnres,  the  SVM  is  trained  to  constrnct  a  decision  plane  in  a  high¬ 
dimensional  space.  Since  it  is  difficnlt  to  hnd  a  good  representative  set  of  non-face  pat¬ 
terns,  a  bootstrapping  techniqne  is  nsed  to  add  mis-classihed  non-face  patterns  into  the 
training  set,  and  then  the  SVM  is  re-trained  to  get  a  better  decision  plane. 

Fignres  4.9,  4.10,  4.11,  and  4.12  show  sets  of  image  blocks  after  histogram  eqnalization. 
They  are  nsed  by  ICA  dnring  the  training  procednre  and  inclnde  face  patterns  with  different 
facial  expressions  and  nnder  slightly  different  views,  non-face  patterns  nsed  in  initial  training, 
and  non-face  patterns  that  are  hrst  misclassihed  and  then  nsed  for  bootstrapping. 

The  testing  part  comprises  the  following  steps: 

1.  A  skin  color  hlter  is  nsed  to  hnd  a  binary  mask  which  locates  the  face  candidate  regions 
in  a  test  image. 

2.  The  test  image  is  rescaled  several  times,  becanse  we  do  not  have  prior  knowledge  abont 
the  face  size. 

3.  A  20  X  20  window  is  moved  over  the  face  candidate  regions  to  select  image  blocks  for  de¬ 
tection. 

4.  ICA  featnres  are  extracted  from  the  image  blocks,  nsing  the  pre-stored  image  bases  which 
are  obtained  dnring  the  training  procednre.  Note  that  we  have  two  different  schemes  for 
featnre  extraction  nsing  two  different  sets  of  image  bases. 
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Figure  4.9:  Face  patterns  with  different  facial  expressions  used  in  training 
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Figure  4.10:  Face  patterns  under  slightly  different  views  used  in  training 
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Figure  4.11:  Non-face  patterns  used  in  initial  training 


Figure  4.12:  Non-face  patterns  used  for  bootstrapping 
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5.  The  trained  SVM  classifies  the  ICA  features. 


6.  Post-processing  is  performed  to  enhance  system  performance: 

•  If  a  detection  appears  at  only  one  scale,  it  is  usually  a  false  detection.  By  ANDing 
the  detection  locations  at  different  scales,  we  can  effectively  reduce  the  number  of 
false  detections. 

•  The  sliding  window  method  usually  leads  to  several  detections  near  a  face  region. 
Thresholding  the  number  of  detections  in  a  neighborhood  tends  to  keep  correct 
detections  and  eliminate  false  detections. 

•  If  a  detection  is  correct,  the  detections  that  overlap  the  correct  one  are  usually  false. 
So  after  the  previous  two  steps,  the  detection  location  with  the  largest  number  of 
detections  within  a  neighborhood  is  assumed  to  be  correct  and  preserved,  while  the 
other  locations  with  fewer  detections  are  eliminated. 

7.  The  system  takes  the  output  of  the  post-processing  as  the  final  detection  result. 

4.6  Experimental  Results 

To  evaluate  the  hybrid  learning  scheme  on  the  face  detection  problem,  we  tested  the  system  on 
820  face  examples  from  the  LAMP  face  database  developed  by  ourselves  and  from  the  Essex 
facial  image  database  [23],  as  well  as  on  100884  nonface  image  blocks  which  we  obtained  from 
the  LAMP  face  database  and  the  web.  In  the  LAMP  face  database,  the  face  examples  were 
recorded  from  TV  shows.  In  the  Essex  facial  image  database,  face  examples  have  expression 
changes  and  position  changes. 

The  results  are  reported  in  Table  4.1.  From  this  table,  we  see  that  using  ICA  feature 


Detection  System 

Number  of 

Miss  Detections 

Number  of 

False  Detections 

The  Hybrid  ICA/SVM  Detection 
System  based  on  ICA  1 

39 

54 

The  Hybrid  ICA/SVM  Detection 
System  based  on  ICA  2 

45 

1743 

The  SVM  Detection  System 
without  ICA  Feature  Extraction 

41 

252 

Table  4.1:  Face  detection  results 


extraction  Method  1,  the  hybrid  learning  scheme  effectively  improves  the  classification  accuracy 
compared  to  the  SVM  detection  system  without  ICA  feature  extraction.  Several  face  detections 
on  the  test  examples  are  shown  in  the  following  figures. 

On  the  other  hand,  using  ICA  feature  extraction  Method  2  leads  to  deterioration  of  perfor¬ 
mance  in  classifying  non-face  examples.  The  possible  reason  might  be  the  dimensionality  of  the 
features,  which  is  80  (reduced  from  400)  in  our  experiment,  and  may  be  too  small  to  represent 
the  original  signals  in  Method  2. 
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Figure  4.13:  Face  detection  example  1:  at  Scale  1 


Figure  4.14:  Face  detection  example  1:  at  Scale  2 
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Figure  4.15:  Face  detection  example  1:  at  Scale  .3 


Figure  4.16:  Face  detection  example  2:  Fi-  Figure  4.17:  Face  detection  example  3:  Fi¬ 
nal  result  nal  result 
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Figure  4.18:  Face  detection  example  4:  Final  result 
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4.7  Conclusion  and  Discussion 


In  this  chapter,  we  have  presented  a  new  hybrid  snpervised/nnsnpervised  learning  scheme  that 
integrates  ICA  and  SVM  to  address  pattern  recognition  problems. 

In  low-level  featnre  extraction,  ICA  hnds  independent  bases  or  coelficients  to  represent  data. 
From  Fignres  4.6,  4.7  and  4.8,  we  see  that  the  ICA  bases  emphasize  edge  information  in  the 
image  data,  as  argned  in  [9].  In  addition,  becanse  ICA  tries  to  make  bases  or  representative 
coefficients  independent  of  each  other,  the  ICA  featnres  represent  the  data  better  than  the  PCA 
featnres  when  the  training  data  are  not  orthogonal  to  each  other  in  the  probability  sense — for 
example,  face  image  data  with  different  facial  expressions  and  in  different  views.  In  high-level 
featnre  classihcation,  as  an  approximate  implementation  of  the  SRM  Principle,  the  SVM  tends 
to  give  good  generalization  performance.  Many  applications  of  SVM  have  proven  this  point. 
A  common  characteristic  of  ICA  and  SVM  is  sparsity.  The  ICA  ontpnt  is  sparse.  As  shown 
in  [48],  ICA  is  formally  eqnivalent  to  sparse  coding  nnder  some  condition.  The  snpport  vectors 
whose  hnear  combination  comprises  the  trained  SVM  are  also  sparse.  In  [29],  Girosi  proves  an 
eqnivalence  between  SVM  and  a  Sparse  Approximation  (SA)  scheme  nnder  noise- free  condition. 
Thns  combining  ICA  and  SVM  yields  a  hierarchical  sparse  learning  scheme.  Experimental 
resnlts  on  the  face  detection  problem  show  that  the  hybrid  ICA/SCM  learning  scheme  effectively 
improves  detection  system  performance,  compared  with  applying  SVM  directly  to  the  original 
image  data. 

An  idea  that  shonld  improve  this  hybrid  learning  scheme  is  integrating  SVM  with  snbband- 
based  ICA,  which  was  proposed  in  Chapter  2.  By  applying  ICA  in  the  time-freqnency  plane, 
snbband-based  ICA  snccessfnlly  separates  mixed  aconstic  signals,  snch  as  speech  and  mnsic 
signals,  even  in  the  presence  of  strong  noise  or  when  performed  on-line.  Thongh  here  the  aim 
is  not  to  separate  mixed  aconstic  signals,  snbband-based  ICA  is  expected  to  hnd  better  signal 
representations,  becanse  it  leads  to  a  more  sparse  ontpnt  than  classical  ICA  algorithms  and 
is  more  robnst  against  noise.  NatnraUy,  snbband-based  ICA  can  be  extended  to  mnlti-scale 
featnre  extraction.  Since  ICA  bases  are  wavelet-like  when  trained  on  natnral  image  data,  an 
interesting  similarity  between  snbband-based  ICA  featnre  extraction  and  hnman  anditory  (or 
visnal)  processing  is  that  both  of  them  have  a  two-layered  wavelet-like  strnctnre. 

In  addition,  inspired  by  the  nse  of  the  kernel  trick  in  SVM,  we  hope  to  constrnct  a  kernel- 
based  ICA  algorithm.  The  idea  is  to  apply  ICA  to  the  high- dimensional  nonlinear  mapped 
space  instead  of  the  original  signal  space.  Kernel-based  ICA  is  expected  to  have  applications 
to  more  efficient  featnre  extraction  and  separation  of  nonhnear  mixed  signals. 

Finally,  we  wonld  like  to  point  ont  that  the  hybrid  ICA/SVM  scheme  is  a  general  learning 
scheme,  which  can  be  applied  to  other  problems  than  face  detection,  snch  as  speech  processing 
and  data  mining. 
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Chapter  5 


Conclusions 


Machine  learning  algorithms  play  increasingly  important  roles  in  many  areas,  snch  as  pat¬ 
tern  recognition,  signal  processing,  and  communications.  In  this  thesis,  we  have  proposed  two 
machine  learning  schemes,  Snbband-based  Independent  Component  Analysis  scheme  and  the 
hybrid  Independent  Component  Analysis/Snpport  Vector  Machine  scheme,  and  applied  them 
to  the  problems  of  blind  aconstic  signal  separation  and  face  detection. 

Inspired  by  onr  nnderstanding  of  the  snbbanding  strategies  nsed  in  the  early  anditory  sys¬ 
tem,  we  have  proposed  snbband-based  ICA,  a  new  powerfnl  learning  algorithm,  to  solve  the 
blind  sonrce  separation  (BSS)  problem  (Chapter  2).  Thongh  classical  ICA  algorithms  have  been 
applied  to  address  the  BSS  problem,  they  do  not  work  well  in  the  presence  of  noise  or  when  per¬ 
formed  on-hne.  By  performing  separation  in  several  freqnency  bands  which  contain  most  of  the 
energy  in  the  mixtnre,  the  new  snbband-based  ICA  approach  is  robnst  against  noise  and  con¬ 
verges  to  the  real  demixing  matrix  qnickly,  even  in  its  on-line  version.  The  experimental  resnlts, 
as  shown  in  Fignres  2.3,  2.4,  and  2.5,  demonstrate  its  snccess  while  other  ICA  algorithms  fail. 
The  virtnally  increased  signal-to-noise  ratio  in  those  freqnency  bands,  the  fact  that  snbband 
signals,  i.e.,  wavelet  coefficients,  are  more  peaky  and  heavy-tailed  distribnted  than  the  original 
signals,  and  the  adaptation  to  the  properties  of  the  signal  and  noise  by  the  incorporation  of  a 
best  basis  selection  algorithm,  all  contribnte  to  the  snccess  of  snbband-based  ICA. 

Snbband-based  ICA  is  also  a  compntationally  efficient  algorithm  becanse  it  rednces  com- 
pntational  complexity  by  performing  separation  in  the  down-sampled  signals  in  several  or  even 
a  single  freqnency  band.  Its  speed  is  mnch  higher  than  those  of  previons  ICA  algorithms,  as 
shown  in  Tables  2.2  and  2.3. 

We  can  fnrther  generalize  snbband-based  ICA  by  replacing  the  snbband  decomposition 
with  some  appropriate  projection.  For  example,  a  nonlinear  projection  can  be  nsed  nnder  some 
criterion,  e.g.,  maximnm  likelihood,  to  derive  a  nonlinear  ICA. 

Onr  fntnre  work  on  the  blind  separation  problem  will  inclnde  nsing  some  signal  cnes,  for 
example,  the  pitches  of  aconstic  signals,  and  available  prior  knowledge  to  gnide  separation. 
In  this  way,  we  may  increase  convergence  speed  and  accomphsh  the  separation  even  in  cases 
where  the  nnmber  of  sensors  is  less  than  the  nnmber  of  sonrces.  Some  work  has  been  initiated 
in  this  direction. 

Snbband  based  ICA  is,  in  essence,  an  nnsnpervised  learning  scheme.  In  Chapter  3,  a  snper- 
vised  learning  algorithm,  the  Snpport  Vector  Machine  (SVM),  is  presented.  As  an  approximate 
implementation  of  the  Strnctnral  Risk  Minimization  (SRM)  Principle  that  is  proposed  in  statis¬ 
tical  learning  theory,  SVM  provides  a  method  of  minimizing  the  snm  of  the  nnmber  of  training 
errors  and  the  VC  dimension,  which  indicates  the  model  complexity,  so  that  high  generahzation 
performance  can  be  achieved. 
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In  addition  to  high  generalization  performance,  SVM  can  control  model  complexity  inde¬ 
pendently  of  the  dimensionality  of  the  featnre  space,  by  reformnlating  the  primal  optimization 
problem  into  its  dnal  problem  and  nsing  an  inner-prodnct  kernel  trick.  Actnally,  an  inhnite  fea¬ 
tnre  space  is  allowed  in  SVM.  Moreover,  the  convex  cost  fnnction  in  the  QP  problem  gnarantees 
that  SVM  will  hnd  a  globally  optimal  solntion  that  antomatically  tnnes  all  the  parameters  in 
the  learning  scheme,  while  many  other  learning  algorithms  snffer  from  falling  into  local  extrema. 

Thongh  originally  derived  from  the  SRM  Principle  to  address  the  problem  of  the  tradeoff 
between  model  complexity  and  generalization  ability,  SVM  is  closely  related  to  other  known 
techniqnes  and  research  problems: 

•  The  snpport  vectors  are  nsnally  sparse.  They  only  constitnte  a  fraction  of  the  total  nnmber 
of  examples  in  the  training  set.  Using  the  reprodncing  property  of  the  Reprodncing 
Kernel  Hilbert  Space  (RKHS),  Girosi  [29]  shows  an  eqnivalence  between  SVMs  in  the 
noiseless  case  and  a  Sparse  Approximation  (SA)  scheme  that  resembles  the  Basis  Pnrsnit 
De-Noising  algorithm  [14]. 

•  Also  in  [29],  Girosi  gives  a  derivation  of  the  SVM  algorithm  in  the  framework  of  regnlariza- 
tion  theory.  In  [24],  Evgenion  et  al.  give  a  nnihed  framework  for  regnlarization  networks 
and  SVM.  The  reformnlation  of  SVM  in  regnlarization  theory  reveals  the  connection  be¬ 
tween  SVM  and  other  known  techniqnes.  However,  it  hides  the  relation  between  SVM 
and  the  SRM  Principle. 

SVM  provides  high  generalization  performance  withont  incorporating  any  prior  knowledge 
abont  the  problem.  An  important  research  topic  is  how  to  incorporate  problem- domain  knowl¬ 
edge  into  SVM  to  fnrther  improve  its  performance.  Some  proposed  approaches  inclnde  adding 
an  additional  term  that  represents  prior  knowledge  in  the  cost  fnnction,  nsing  prior  knowledge 
to  design  the  kernel  fnnction  [59],  and  adding  virtnal  examples  into  the  training  set  [58].  More 
efficient  and  natnral  ways  of  adding  prior  knowledge  into  SVM  are  yet  to  be  developed.  For 
example,  integrating  Bayesian  learning  theory  into  SVM  might  be  a  good  way  of  exploiting 
prior  information. 

Another  research  topic  related  to  SVM  is  that  the  kernel  trick  in  SVM  can  also  be  nsed  in 
other  algorithms  that  are  based  on  the  inner  prodnct  of  the  data.  For  example.  Principal  Com¬ 
ponent  Analysis  can  be  done  in  a  high- dimensional  featnre  space  by  nsing  a  snitable  nonlinear 
kernel  fnnction  [60].  Fisher  discriminant  analysis  also  nses  a  similar  idea  [45]. 

Finally  in  Chapter  4,  we  have  presented  a  new  hybrid  snpervised/nnsnpervised  learning 
scheme  that  integrates  ICA  and  SVM  to  address  pattern  recognition  problems. 

In  low-level  featnre  extraction,  ICA  hnds  independent  bases  or  coefficients  to  represent  data. 
From  Fignres  4.6,  4.7  and  4.8,  we  can  see  that  the  ICA  bases  emphasize  edge  information  in  the 
image  data,  as  argned  in  [9].  In  addition,  becanse  ICA  tries  to  make  data  bases  or  representation 
coefficients  independent  of  each  other,  the  ICA  featnres  represent  the  data  better  than  the  PCA 
featnres  when  the  training  data  are  not  orthogonal  to  each  other  in  a  probabihty  sense — for 
example,  face  image  data  with  different  facial  expressions  and  seen  in  different  views.  In  high- 
level  featnre  classihcation,  as  an  approximate  implementation  of  the  SRM  Principle,  SVM  tends 
to  have  good  generalization  performance.  Many  applications  of  SVM  have  proven  this  point. 
One  common  characteristic  shared  by  ICA  and  SVM  is  sparseness.  The  ICA  ontpnt  is  sparse. 
As  shown  in  [48],  ICA  is  formally  eqnivalent  to  sparse  coding  nnder  some  conditions.  The 
snpport  vectors  whose  linear  combination  comprises  the  trained  SVM  are  also  sparse.  Thns 
combining  ICA  and  SVM  yields  a  hierarchical  sparse  learning  scheme.  Experimental  resnlts  on 
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the  face  detection  problem  show  that  the  hybrid  ICA/SCM  learning  scheme  effectively  improves 
detection  system  performance,  compared  with  applying  SVM  directly  to  the  original  image  data. 

One  idea  that  shonld  improve  this  hybrid  learning  scheme  is  integrating  SVM  with  snbband- 
based  fCA,  which  was  proposed  in  Chapter  2.  By  applying  fCA  in  the  time-freqnency  plane, 
snbband-based  fCA  snccessfnlly  separates  mixed  aconstic  signals,  snch  as  speech  and  mnsic 
signals,  even  in  the  presence  of  strong  noise  or  when  performed  on-line.  Thongh  here  the  aim 
is  not  to  separate  mixed  aconstic  signals,  snbband-based  fCA  is  expected  to  hnd  better  signal 
representations,  becanse  it  leads  to  a  more  sparse  ontpnt  than  classical  fCA  algorithms  and 
is  more  robnst  against  noise.  NatnraUy,  snbband-based  fCA  can  be  extended  to  mnlti-scale 
featnre  extraction.  Since  fCA  bases  are  wavelet-like  when  trained  on  natnral  image  data,  an 
interesting  similarity  between  snbband-based  fCA  featnre  extraction  and  hnman  anditory  (or 
visnal)  processing  is  that  both  of  them  have  a  two-layered  wavelet-like  strnctnre. 

In  addition,  inspired  by  the  nse  of  the  kernel  trick  in  SVM,  we  hope  to  constrnct  a  kernel- 
based  fCA  algorithm.  The  idea  is  to  apply  fCA  to  the  high- dimensional  nonlinear  mapped 
space  instead  of  the  original  signal  space.  Kernel-based  fCA  is  expected  to  have  applications 
to  more  efficient  featnre  extraction  and  separation  of  nonhnear  mixed  signals. 

Finally,  we  wonld  like  to  point  ont  that  the  hybrid  fCA/SVM  learning  scheme  is  a  gen¬ 
eral  scheme,  which  can  be  applied  to  other  problems  than  face  detection,  for  example,  speech 
processing  or  data  mining. 


44 


Bibliography 


[1]  J.  B.  Allen.  How  do  humans  process  and  recognize  speech?  IEEE  Trans,  on  Speech  and 
Audio  Processing,  2:567-577,  1994. 

[2]  S.  Amari,  T.  P.  Chen,  and  A.  Cichocki.  Non-holonomic  orthogonal  learning  algorithm  for 
blind  source  separation. 

[3]  S.  Amari  and  A.  Cichocki.  Adaptive  blind  signal  processing — neural  network  approaches. 
Proceedings  of  the  IEEE,  86:2026-48,  1998. 

[4]  S.  Amari,  A.  Cichocki,  and  H.  H.  Yang.  A  new  learning  algorithm  for  bhnd  signal  separa¬ 
tion.  In  D.  Touretzky,  M.  Mozer,  and  M.  Hasselmo,  editors.  Advances  in  Neural  Informa¬ 
tion  Processing  Systems,  volume  8,  pages  752-763.  MIT  Press,  Cambridge,  MA,  1996. 

[5]  J.  J.  Atick.  Entropy  minimization:  A  design  principle  for  sensory  perception?  International 
■lournal  of  Neural  Systems,  3:81-90,  1992. 

[6]  H.  B.  Barlow.  Unsupervised  learning.  Neural  Computation,  1:295-311,  1989. 

[7]  H.  B.  Barlow.  What  is  the  computational  goal  of  the  neocortex  ?  In  Large-scale  Neuronal 
Theories  of  the  Brain.  MIT  Press,  Cambridge,  MA,  1994. 

[8]  M.  S.  Bartlett,  H.  M.  Lades,  and  T.  J.  Sejnowski.  Independent  component  representations 
for  face  recognition.  In  Proc.  SPIE  Conf.  on  Human  Vision  and  Electronic  Imaging  III, 
volume  3299,  pages  528-539,  1998. 

[9]  A.  J.  Bell  and  T.  J.  Sejnowski.  Edges  are  the  independent  components  of  natural  scenes. 
In  M.  C.  Mozer,  M.  1.  Jordan,  and  T.  Petsche,  editors.  Advances  in  Neural  Information 
Processing  Systems,  volume  9,  page  831.  MIT  Press,  Cambridge,  MA,  1997. 

[10]  V.  Blanz,  B.  Scholkopf,  H.  Biilthoff,  C.  Burges,  V.  Vapnik,  and  T.  Vetter.  Comparison 
of  view-based  object  recognition  algorithms  using  realistic  3D  models.  In  C.  von  der 
Malsburg,  W.  von  Seelen,  J.  C.  Vorbriiggen,  and  B.  Sendhoff,  editors.  Artificial  Neural 
Networks — ICANN’96,  pages  251-256,  Berlin,  1996.  Springer  Lecture  Notes  in  Computer 
Science,  Vol.  1112. 

[11]  C.  J.  C.  Burges.  A  tutorial  on  support  vector  machines  for  pattern  recognition.  Knowledge 
Discovery  and  Data  Mining,  2(2),  1998. 

[12]  J.-F.  Cardoso.  Blind  signal  separation:  statistical  principles.  Proceedings  of  the  IEEE, 
86:2009-2025,  1998. 

[13]  R.  Chellappa,  C.  L.  Wilson,  S.  Sirohey,  and  C.  S.  Barnes.  Human  and  machine  recognition 
of  faces:  A  survey.  Technical  Report  CS-TR-3339,  Department  of  Computer  Science, 
University  of  Maryland,  College  Park,  August  1994. 

[14]  S.  S.  B.  Chen,  D.  L.  Donoho,  and  M.  A.  Saunders.  Atomic  decomposition  by  basis  pursuit. 
Technical  Report  409,  Department  of  Statistics,  Stanford  University,  1995. 

[15]  R.  Coifman  and  M.  V.  Wickerhauser.  Entropy-based  algorithms  for  best  basis  selection. 
IEEE  Trans,  on  Information  Theory,  38:713-719,  1992. 


45 


[16]  A.  J.  Colmenarez  and  T.  S.  Huang.  Face  detection  with  information  based  maximum 
discrimination.  In  Proc.  CVPR,  pages  782-787,  1997. 

[17]  T.  M.  Cover.  Geometrical  and  statistical  properties  of  systems  and  linear  inequalities  with 
applications  in  pattern  recognition.  IEEE  Trans,  on  Electronic  Computers.,  19:326-334, 
1965. 

[18]  Y.  Dai,  Y.  Nakano,  and  H.  Miyao.  Extraction  of  facial  images  from  a  complex  background 
using  SGLD  matrices.  In  Proc.  ICPR,  pages  A:137-141,  1994. 

[19]  I.  Daubechies.  Ten  Lectures  on  Wavelets.  SIAM,  1992. 

[20]  G.  Deco  and  D.  Obradovic.  Linear  redundancy  reduction  learning.  Neural  Networks, 
8:751-755, 95. 

[21]  G.  L.  Donato,  M.  S.  Bartlett,  J.  C.  Hager,  P.  Ekman,  and  T.  J.  Sejnowski.  Classifying 
facial  actions.  IEEE  Trans,  on  Pattern  Analysis  and  Machine  Intelligence,  21:974-989, 
1999. 

[22]  D.  L.  Donoho.  De-noising  by  soft-thresholding.  IEEE  Trans,  on  Information  Theory, 
41:613-627,  1995. 

[23]  The  Essex  facial  image  database,  http://cswww.essex.ac.uk/mv/projects.html. 

[24]  T.  Evgeniou,  M.  Pontil,  and  T.  Poggio.  A  unified  framework  for  regularization  networks  and 
support  vector  machines.  Technical  Memo  AIM-1654,  Artificial  Intelligence  Laboratory, 
Massachusetts  Institute  of  Technology,  December  1999. 

[25]  D.  J.  Field.  What  is  the  goal  of  sensory  coding?  Neural  Computation,  6:559-601,  1994. 

[26]  M.  M.  Fleck,  D.  A.  Forsyth,  and  C.  Bregler.  Finding  naked  people.  In  Proc.  4th  European 
Conference  on  Computer  Vision,  1996. 

[27]  R.  Fletcher.  Practical  Methods  of  Optimization,  2nd  ed.  Wiley,  New  York,  1987. 

[28]  Z.  Ghahramani  and  G.  E.  Hinton.  The  EM  algorithm  for  mixtures  of  factor  analyzers. 
Technical  Report  CRG-TR-96-1,  University  of  Toronto,  1996. 

[29]  F.  Girosi.  An  equivalence  between  sparse  approximation  and  support  vector  machines. 
Neural  Computation,  10:1455-1480,  1998. 

[30]  H.  P.  Graf,  E.  Cosatto,  D.  Gibbon,  M.  Kocheisen,  and  E.  Petajan.  Multimodal  systmem 
for  locating  heads  and  faces.  In  Proc.  IEEE  Int’l.  Conf.  on  Automatic  Eace  and  Gesture 
Recognition,  pages  88-93,  1996. 

[31]  S.  Haykin.  Neural  networks:  A  Comprehensive  Eoundation.  Prentice  HaU,  Upper  Saddle 
River,  NJ,  2nd  edition,  1999. 

[32]  P.  J.  Huber.  Projection  pursuit.  Annals  of  Statistics,  13:435-475,  1985. 

[33]  http : / /www.cis .hut .fi /projects /ica /fastica / . 

[34]  A.  Hyvarinen.  Fast  and  robust  fixed-point  algorithms  for  independent  component  analysis. 
IEEE  Trans,  on  Neural  Networks,  10:626-634,  1999. 


46 


[35]  A.  Hyvarinen.  Survey  on  independent  component  analysis.  Neural  Computing  Surveys, 
2:94-128,  1999. 

[36]  A.  Hyvarinen,  P.  Hoyer,  and  E.  Oja.  Sparse  code  shrinkage:  Denoising  by  nonlinear 
maximum  likelihood  estimation.  In  Advances  in  Neural  Information  Processing  Systems, 
volume  11,  1999. 

[37]  A.  Hyvarinen  and  P.  Pajunen.  Nonlinear  independent  component  analysis:  Existence  and 
uniqueness  results.  Neural  Networks,  12:429-439,  1999. 

[38]  http://sound.media.mit.edu/ica-bench/sources/. 

[39]  T.-W.  Lee,  M.  Girolami,  and  T.  J.  Sejnowski.  Independent  component  analysis  using  an 
extended  infomax  algorithm  for  mixed  sub- Gaussian  and  super- Gaussian  sources.  Neural 
Computation,  11:409-433,  1999. 

[40]  T.-W.  Lee,  B.  U.  Koehler,  and  R.  Orglmeister.  Blind  source  separation  of  nonlinear  mixing 
models.  In  Proc.  IEEE  Int’l.  Workshop  on  Neural  Networks  for  Signal  Processing,  pages 
406-415,  1997. 

[41]  http://www.cnl.salk.edu/  tewon/. 

[42]  C.  Liu  and  H.  Wechsler.  Comparative  assessment  of  independent  component  analysis  (ICA) 
for  face  recognition.  In  Proc.  Int’l.  Conf.  on  Audio-  and  Video-based  Biometric  Person 
Authentication,  1999. 

[43]  R.  Lyon  and  S.  Shamma.  Auditory  representations  of  timbre  and  pitch.  In  Auditory 
Computation,  pages  221-270.  Springer,  Berlin,  1995. 

[44]  J.  Mercer.  Functions  of  positive  and  negative  type,  and  their  connection  with  the  theory 
of  integral  equations.  Transactions  of  the  London  Philosophical  Society  (A),  209:415-446, 
1909. 

[45]  S.  Mika,  G.  Ratsch,  J.  Weston,  B.  Scholkopf,  and  K.-R.  Muller.  Fisher  discriminant  analysis 
with  kernels.  In  Y.-H.  Hu,  J.  Larsen,  E.  Wilson,  and  S.  Douglas,  editors.  Neural  Networks 
for  Signal  Processing  IX,  pages  41-48.  IEEE,  1999. 

[46]  B.  Moghaddam  and  A.  P.  Pentland.  Probabilistic  visual  learning  for  object  recognition. 
IEEE  Trans,  on  Pattern  Analysis  and  Machine  Intelligence,  19:696-710,  1997. 

[47]  B.  A.  Olshausen  and  D.  J.  Field.  Emergence  of  simple-cell  receptive  held  properties  by 
learning  a  sparse  code  for  natural  images.  Nature,  381:607-609,  1996. 

[48]  B.  A.  Olshausen  and  D.  J.  Field.  Sparse  coding  with  an  overcomplete  basis  set:  A  strategy 
employed  by  vl?  Vision  Research,  37:3311-3325,  1997. 

[49]  E.  Osuna,  R.  Freund,  and  F.  Girosi.  Training  support  vector  machines:  An  application  to 
face  detection.  In  Proc.  CVPR,  pages  130-136,  1997. 

[50]  R.  J.  Qian  and  T.  S.  Huang.  Object  detection  using  hierarchical  MRF  and  MAP  estimation. 
In  Proc.  CVPR,  pages  186-192,  1997. 


47 


[51]  A.  N.  Rajagopalan,  K.  S.  Kumar,  J.  Karlekar,  R.  Manivasakan,  M.  M.  Patil,  U.  B.  Desai, 
P.  G.  Poonacha,  and  S.  Chaudhuri.  Finding  faces  in  photographs.  In  Proc.  ICCV98,  pages 
640-645,  1998. 

[52]  R.  A.  Redner  and  H.  F.  Walker.  Mixture  densities,  maximum  likelihood  and  the  EM 
algorithm.  SIAM  Review,  26:195-2.39,  1984. 

[53]  D.  Roth,  M.  Yang,  and  N.  Ahuja.  A  SNoW-based  face  detector,  2000. 

[54]  H.  A.  Rowley,  S.  Baluja,  and  T.  Kanade.  Neural  network-based  face  detection.  IEEE 
Trans,  on  Pattern  Analysis  and  Machine  Intelligence,  20:23-38,  1998. 

[55]  H.  A.  Rowley,  S.  Baluja,  and  T.  Kanade.  Rotation  invariant  neural  network-based  face 
detection.  In  Proc.  CVPR,  page  963,  1998. 

[56]  F.  S.  Samaria.  Pace  Detection  Using  Hidden  Markov  Models.  PhD  thesis.  University  of 
Cambridge,  1994. 

[57]  M.  Schmidt.  Identifying  speaker  with  support  vector  machine.  In  Interface’96,  Sydney, 
1996. 

[58]  B.  Scholkopf,  C.  Burges,  and  V.  Vapnik.  Incorporating  invariances  in  support  vector  learn¬ 
ing  machines.  In  C.  von  der  Malsburg,  W.  von  Seelen,  J.  C.  Vorbriiggen,  and  B.  Sendhoff, 
editors.  Artificial  Neural  Networks,  volume  1112  of  Lecture  Notes  in  Computer  Science, 
pages  47-52.  Springer,  Berlin,  1996. 

[59]  B.  Scholkopf,  P.  Y.  Simard,  A.  J.  Smola,  and  V.  N.  Vapnik.  Prior  knowledge  in  support 
vector  kernels.  In  M.  I.  Jordan,  M.  J.  Kearns,  and  S.  A.  Solla,  editors.  Advances  in  Neural 
Information  Processing  Systems,  volume  10,  pages  640-646.  MIT  Press,  Cambridge,  MA, 
1998. 

[60]  B.  Scholkopf,  A.  Smola,  and  K.-R.  Muller.  Kernel  principal  component  analysis.  In 
B.  Scholkopf,  C.  J.  C.  Burges,  and  A.  J.  Smola,  editors.  Advances  in  Kernel  Methods — 5F 
Learning,  pages  327-352.  MIT  Press,  Cambridge,  MA,  1999. 

[61]  E.  P.  Simoncelli  and  E.  H.  Adelson.  Noise  removel  via  Bayesian  wavelet  coring.  In  IEEE 
Int’l.  Conf.  on  Image  Processing,  1996. 

[62]  K.  K.  Sung  and  T.  Poggio.  Example-based  learning  for  view-based  human  face  detection. 
IEEE  Trans,  on  Pattern  Analysis  and  Machine  Intelligence,  20:39-51,  1998. 

[63]  J.  H.  van  Hateren  and  A.  van  der  Schaaf.  Independent  component  biters  of  natural  images 
compared  with  simple  cells  in  primary  visual  cortex.  Proc.  Royal  Society,  265  B:359-366, 
1998. 

[64]  V.  Vapnik.  Estimation  of  Dependences  Based  on  Empirical  Data  (in  Russian).  Nauka, 
Moscow,  1979.  (Enghsh  translation:  Springer,  New  York,  1982). 

[65]  V.  Vapnik.  The  Nature  of  Statistical  Learning  Theory.  Springer,  New  York,  1995. 

[66]  J.  Yang  and  A.  Waibel.  A  real-time  face  tracker.  In  Proc.  WACV,  1996. 

[67]  M.  H.  Yang,  N.  Ahuja,  and  D.  Kriegma.  A  survey  on  face  detection  methods,  1999. 


48 


[68]  X.  Yang,  K.  Wang,  and  S.  Shamnia.  Anditory  representations  of  aconstic  signals.  IEEE 
Trans,  on  Information  Theory,  38:824-839,  1992. 

[69]  A.  L.  Ynille,  D.  S.  Cohen,  and  P.  W.  Hallinan.  Featnre  extraction  from  faces  nsing  de¬ 
formable  templates.  IICV,  8:99-111,  1992. 


49 


